Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformatics**|**PMC2881371

Formats

Article sections

- Abstract
- 1 INTRODUCTION
- 2 THE PROBLEM OF TWO-LOCUS EPISTASIS DETECTION IN HUMAN GWAS
- 3 FREE VARIABLES IN THE CONTINGENCY TABLE OF TWO-LOCUS TEST
- 4 BUILDING THE MINIMUM SPANNING TREE ON THE SNPS
- 5 INCREMENTALLY UPDATING OBSERVED FREQUENCY Od2
- 6 THE TEAM ALGORITHM
- 7 EXPERIMENTAL RESULTS
- 8 CONCLUSION AND FUTURE WORK
- REFERENCES

Authors

Related links

Bioinformatics. 2010 June 15; 26(12): i217–i227.

Published online 2010 June 1. doi: 10.1093/bioinformatics/btq186

PMCID: PMC2881371

* To whom correspondence should be addressed.

Copyright © The Author(s) 2010. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

As a promising tool for identifying genetic markers underlying phenotypic differences, genome-wide association study (GWAS) has been extensively investigated in recent years. In GWAS, detecting epistasis (or gene–gene interaction) is preferable over single locus study since many diseases are known to be complex traits. A brute force search is infeasible for epistasis detection in the genome-wide scale because of the intensive computational burden. Existing epistasis detection algorithms are designed for dataset consisting of homozygous markers and small sample size. In human study, however, the genotype may be heterozygous, and number of individuals can be up to thousands. Thus, existing methods are not readily applicable to human datasets. In this article, we propose an efficient algorithm, TEAM, which significantly speeds up epistasis detection for human GWAS. Our algorithm is exhaustive, i.e. it does not ignore any epistatic interaction. Utilizing the minimum spanning tree structure, the algorithm incrementally updates the contingency tables for epistatic tests without scanning all individuals. Our algorithm has broader applicability and is more efficient than existing methods for large sample study. It supports any statistical test that is based on contingency tables, and enables both family-wise error rate and false discovery rate controlling. Extensive experiments show that our algorithm only needs to examine a small portion of the individuals to update the contingency tables, and it achieves at least an order of magnitude speed up over the brute force approach.

**Contact:** ude.cnu.sc@gnaix

Genetic association analysis examines the statistical correlation between an organism's genotype with its phenotype. With the development of high-throughput genotyping technologies, genetic variation of human and other model organisms has been measured at genome-wide scale. As the most abundant source of genetic variation, the number of single nucleotide polymorphism (SNPs) in public databases (dbGaP, JAX) is up to millions. Genome-wide association study (GWAS) has been shown to be a promising tool to locate the genetic factors that cause phenotypic differences (Saxena *et al.*, 2007; Scuteri *et al.*, 2007; WTCCC, 2007; Weedon *et al.*, 2007). Epistasis, or gene–gene interaction detection, has received increasing attention in complex trait analysis. Different from single-locus approach, the goal of two-locus epistasis detection is to identify interacting SNP pairs that have strong association with the phenotype. Please refer to Balding (2006), Hirschhorn and Daly (2005), Hoh and Ott (2003) and Musani *et al.* (2007) for reviews of current progress and challenges in epistasis detection in GWAS.

There are two grand challenges in epistasis detection. The first is to develop statistical tests that can effectively capture the interaction between SNPs. Various tests have been proposed for two-locus association study, such as the chi-square test, likelihood ratio test and entropy-based test (Balding, 2006). Another crucial challenge in two-locus association study is the intensive computational burden imposed by the enormous search space. Suppose that there are *N* SNPs for *M* individuals. The overall search space of pairwise interactions is *MN*(*N* − 1)/2. The large number of tests also causes the multiple testing problem (Miller, 1981). Controlling the family-wise error rate (FWER) and false discovery rate (FDR) are standard ways to control the error rate (Dudoit and Laan, 2008; Westfall and Young, 1993). In the FWER and FDR controlling, permutation test is preferred over simple Bonferroni correction since many SNPs are correlated (Churchill and Doerge, 1994). The correlation structure among genotype profiles is preserved across permutations and is incorporated into permutation *P*-value estimation. The idea of permutation test is to randomly shuffle the phenotype values among the individuals and recalculate the test statistics. The distribution of these test values are used to estimate the null distribution. Permutation test dramatically increases the search space. With *K* permutations, the entire search space of two-locus association mapping is *KMN*(*N* − 1)/2. Consider a moderate GWAS setting, in which *M* = 1000, *N* = 100 000 and *K* = 1000. The size of the search space is about 5 × 10^{15}. Apparently, a brute force enumeration of the search space is infeasible and thus efficient algorithms are in demand.

Although the computational challenge of epistasis detection has been well recognized, the algorithmic development is still very limited. For a small number of SNPs, e.g. from tens to a few hundreds, exhaustive algorithms that explicitly enumerate all possible SNP combinations have been developed (Nelson *et al.*, 2001; Ritchie *et al.*, 2001). These methods are not scalable for genome-wide computing. Genetic algorithm (Carlborg *et al.*, 2000) has been proposed. This approach is heuristic, which does not guarantee to find the optimal solution. To avoid explicitly exploring the entire search space, a common heuristic used in epistasis detection is a two-step approach (Evans *et al.*, 2006; Hoh *et al.*, 2000; Yang *et al.*, 2009). First, a subset of SNPs are selected according to certain criteria. Then the selected SNPs are used for subsequent epistatic analysis. However, the SNP screening process suffers from the same problem as the single-locus approach. SNPs with strong epistasis but low marginal effects are likely to be filtered out (Zhang *et al.*, 2009a).

Recently, the approach based on search space pruning has been shown to be able to dramatically speed up the process of epistasis detection without compromising the optimality of the results. FastANOVA (Zhang *et al.*, 2008) and FastChi (Zhang *et al.*, 2009b) are specifically designed for ANOVA test and chi-square test, respectively. The COE algorithm (Zhang *et al.*, 2009a) is a more general approach that is applicable to all convex tests. Utilizing an upper bound derived for the test being used, these algorithms only need to examine a small number of promising SNP pairs and prune the SNP pairs that are proven to have no strong association with the phenotype. Unlike heuristic approaches, these algorithms are guaranteed to find the optimal solution. Although these methods provide promising alternatives for GWAS, there are two major drawbacks that limit their applicability. First, they are designed for relatively small sample size and only consider homozygous markers (i.e. each SNP can be represented as a {0, 1} binary variable). In human study, however, the sample size is usually large and most SNPs contain heterozygous genotypes and are coded using {0, 1, 2}. These make existing methods intractable. Second, although the FWER and the FDR are both widely used for error controlling, existing methods are designed only to control the FWER. From a computational point of view, the difference in the FWER and the FDR controlling is that, to estimate FWER, for each permutation, only the maximum two-locus test value is needed. To estimate the FDR, on the other hand, for each permutation, all two-locus test values must be computed. Further details of the FWER and the FDR controlling are described in Section 2.

In this article, we propose an exhaustive algorithm, TEAM (Tree-based Epistasis Association Mapping), for efficient epistasis detection in human GWAS. TEAM has several advantages over previous methods.

- It supports to both homozygous and heterozygous data.
- By exhaustively computing all two-locus test values in permutation test, it enables both FWER and FDR controlling.
- It is applicable to all statistics based on the contingency table. Previous methods are either designed for specific tests or require the test statistics to satisfy certain property.
- Experimental results demonstrate that TEAM is more efficient than existing methods for large sample study.

TEAM incorporates permutation test for proper error controlling. The key idea is to incrementally update the contingency tables of two-locus tests. We show that only four of the 18 observed frequencies in the contingency table need to be updated to compute the test value. In the algorithm, we build a minimum spanning tree (Cormen *et al.*, 2001) on the SNPs. The nodes of the tree are SNPs. Each edge represents the genotype difference between the two connected SNPs. This tree structure can be utilized to speed up updating process for the contingency tables. A majority of the individuals are pruned and only a small portion are scanned to update the contingency tables. This is advantageous in human study, which usually involves thousands of individuals. Extensive experimental results demonstrate the efficiency of the TEAM algorithm.

Suppose that the genotype dataset consists of *N* SNPs {*X*_{1},…, *X*_{N}} for *M* individuals {*S*_{1},…, *S*_{M}}. We adopt the convention of using 0 and 2 to represent the homozygous majority and homozygous minority genotypes, respectively, and 1 to represent the heterozygous case. Let *Y*_{0} {0, 1} be the phenotype of interest (0 for controls and 1 for cases). Let *Y*′ = {*Y*_{1},…, *Y*_{K}} be the set of *K* permutations of *Y*_{0}. In each permutation *Y*_{k}, the phenotype labels are randomly reassigned to individuals with no replacement.

Table 1 shows an example dataset of SNPs and phenotype permutations. The genotype dataset consists of six SNPs {*X*_{1},…, *X*_{6}} for 24 individuals {*S*_{1},…, *S*_{24}}. Individuals {*S*_{1},…, *S*_{12}} are cases and {*S*_{13},…, *S*_{24}} are controls. The phenotype is permuted five times, i.e. *Y*′ = {*Y*_{1},…, *Y*_{5}}.

An example dataset consisting of six SNPs {*X*_{1},…, *X*_{6}}, the original phenotype *Y*_{0} and five phenotype permutations {*Y*_{1},…, *Y*_{5}} for 24 individuals {*S*_{1},…, *S*_{24}}

Let denote the statistical test to be used. Specifically, we represent the test value of SNP *X*_{i} and phenotype *Y*_{k} (0≤*k*≤*K*) as (*X*_{i}, *Y*_{k}), and represent the test value of SNP pair (*X*_{i}*X*_{j}) and *Y*_{k} as (*X*_{i}*X*_{j}, *Y*_{k}). A contingency table that records the observed values of certain events, is the basis of many statistical tests. Tables 2–4 show contingency tables for the single-locus tests (*X*_{i}, *Y*_{k}) and (*X*_{j}, *Y*_{k}), genotype relationship between SNPs *X*_{i} and *X*_{j} and two-locus test (*X*_{i}*X*_{j}, *Y*_{k}), respectively.

Due to the large number of hypotheses being tested, multiple testing problem has received considerable attention in GWAS. Controlling the FWER and FDR are two widely used approaches to control the error rate. The FWER is the probability of having at least one false positive. The FDR is the expected proportion of false positives among rejected hypotheses. Permutation test is the standard way to estimate the null distribution in both approaches. Next, we briefly describe the typical procedures of the FWER and FDR control. For statistical background of these approaches, refer to Dudoit and Laan (2008) and Westfall and Young (1993) for details.

*The FWER controlling procedure*: for each permutation *Y*_{k} *Y*′, let _{Yk} represent the maximum test value among all SNP pairs, i.e. _{Yk} = max {(*X*_{i}*X*_{j}, *Y*_{k})|1 ≤ *i* < *j* ≤ *N*}. The distribution of {_{Yk}|*Y*_{k} *Y*′} is used as the null distribution. Given an error rate threshold α, the *critical value* _{α} is the α*K*-th largest value in {_{Yk}|*Y*_{k} *Y*′}. A SNP pair (*X*_{i}*X*_{j}) is considered significant if its test value with the original phenotype *Y*_{0} exceeds the critical value, i.e. _{(}*X*_{i}*X*_{j}, *Y*_{0})≥_{α}.

*The FDR controlling procedure*: let *PV* represent the set of the pooled test values of all permutation tests, i.e. *PV*={(*X*_{i}*X*_{j}, *Y*_{k})|1≤*i*<*j*≤*N*, 1≤*k*≤*K*}. The *P*−value of test (*X*_{i}*X*_{j}, *Y*_{0}) can be calculated as *p*((*X*_{i}*X*_{j}, *Y*_{0}))=|{*t*≥(*X*_{i}*X*_{j}, *Y*_{0})|*t**PV*}|/|*PV*|, i.e. the proportion of the values in *PV* that are no less than (*X*_{i}*X*_{j}, *Y*_{0}). Let *p*_{(1)}≤*p*_{(2)}…≤*p*_{(N(N−1)/2)} be the ordered *P*-values of the original tests. Let . The classic Benjamini–Hochberg method rejects all hypotheses for which the corresponding *P*-values are in the set {*p*_{(1)}, *p*_{(2)},…, *p*_{(v)}}.

In the FWER controlling, we only need the maximum test value of each permutation. To control the FDR, all test values need to be computed to estimate the *P*-value of the original tests. The existing algorithms, such as FastChi (Zhang *et al.*, 2009b) and COE (Zhang *et al.*, 2009a), prune the SNP pairs having weak associations. Thus they cannot be used to control the FDR. Our algorithm, TEAM, exhaustively computes the test values of all SNP pairs for every permutation. It can be used for both the FWER and FDR controlling. In this article, we mainly focus on the problem of permutation test, since it is the most computationally intensive procedure. Testing SNP pairs using original phenotype can be treated as a special case of permutation test.

Let *E*_{event} and *O*_{event} denote the expected frequency and observed frequency of an event in Tables 2–4. Note that each event represents a subset of individuals. For example, event *D* is a subset of individuals satisfying (*X*_{i} = 1 ∧ *Y*_{k} = 1), and *O*_{D} represents its observed frequency, i.e. *O*_{D} = |*D*|. Using the dataset in Table 1, consider *X*_{3} and *Y*_{4} (i.e. *i* = 3 and *k* = 4), we have *D* = {*S*_{10}, *S*_{13}, *S*_{19}}, and *O*_{D} = 3.

Many statistics, such as chi-square test and likelihood ratio test are defined as functions of the observed frequencies in contingency tables. For any test based on the contingency table, to calculate the two-locus test value (*X*_{i}*X*_{j}, *Y*_{k}), one needs all 18 observed frequencies for the events in the two-locus contingency table shown in Table 4. The following theorem shows that we only need four of the 18 values to calculate the two-locus test value given the three contingency tables in Tables 2 and and33.

*For SNPs X*_{i}, *X*_{j} *and permutation Y*_{k}, *given the observed frequencies in* *Tables 2* *and* *3*,*3*, *specifically, the values of* {*O*_{D}, *O*_{F}, *O*_{J}, *O*_{L}, *O*_{O}, *O*_{S}, *O*_{P}, *O*_{V}, *O*_{T}, *O*_{Q}, *O*_{W}, *O*_{R}, *O*_{U}, *O*_{Z}}, *all of the observed frequencies in* *Table 4* *can be determined if the values of* {*O*_{d2}, *O*_{d3}, *O*_{f2}, *O*_{f3}} *are known*.

See Appendix.

Suppose that we have all the single-locus contingency tables, i.e. Table 2. Given a SNP pair (*X*_{i}, *X*_{j}), Table 3 is fixed. Thus, from Theorem 3.1, for permutation *Y*_{k}, once we have the values of {*O*_{d2}, *O*_{d3}, *O*_{f2}, *O*_{f3}}, (*X*_{i}*X*_{j}, *Y*_{k}) can be calculated accordingly. In the following, we show that these values can be computed incrementally utilizing a minimum spanning tree built on SNPs. We focus on the incremental process for *O*_{d2}. The same process can be applied to update *O*_{d3}, *O*_{f2} and *O*_{f3}. We first discuss how to update *O*_{d2} for a specific permutation. Then we show that the procedure can also handle all the permutations in a batch mode.

To build a minimum spanning tree (Cormen *et al.*, 2001) on the SNPs, let the SNPs {*X*_{1}, *X*_{2},…, *X*_{N}} be the nodes and SNP pairs (*X*_{i}*X*_{j}) (*i*≠*j*) be the (undirected) edges. For each edge (*X*_{i}*X*_{j}), we denote its weight (the number of individuals having different genotypes in the two SNPs) as *w*(*X*_{i}*X*_{j}). A *spanning tree* is a tree that spans (connects) all SNPs. Let *V*() be its node set and *E*() be its edge set. A *minimum spanning tree* is a spanning tree whose weight *W*_{}=∑*w*(*X*_{i}*X*_{j}), where (*X*_{i}*X*_{j}) *E*(), is no greater than any other spanning tree. Figure 1 shows the minimum spanning tree built using the example dataset in Table 1. The number on each edge represents its weight. For example, in *X*_{3} and *X*_{2}, there are six individuals, {*S*_{2}, *S*_{8}, *S*_{10}, *S*_{12}, *S*_{15}, *S*_{20}}, having different genotypes.

For any individual, the genotype difference from *X*_{i} to *X*_{j} can be any one of the six combinations, i.e. 0→ 1 (indicating that the genotype in *X*_{i} is 0, and the genotype in *X*_{j} is 1), 1→ 0, 0→ 2, 2→ 0, 1→ 2 and 2→ 1. Using the example dataset in Table 1, Table 5 shows the genotype differences between the connected two SNPs in the minimum spanning tree in Figure 1. We use (*X*_{i}*X*_{j})_{{u→v}}(*u*, *v* {0, 1, 2}) to represent the set of individuals whose genotype in *X*_{i} is *u* and genotype in *X*_{j} is *v*. For example, (*X*_{3}*X*_{2})_{{1→2}}={*S*_{8}, *S*_{10}} and (*X*_{3}*X*_{2})_{{1→2}{0→2}} = {*S*_{2}, *S*_{8}, *S*_{10}}.

In this section, we discuss how to update *O*_{d2} by utilizing the minimum spanning tree. For clarity, from now on, we use *d*_{2}(*X*_{i}*X*_{j}, *Y*_{k}) to denote the specific event *d*_{2} for the SNP pair (*X*_{i}*X*_{j}) and permutation *Y*_{k}, i.e. the subsets of individuals satisfying (*X*_{i} = 1 ∧ *X*_{j} = 1 ∧ *Y*_{k=1}). We use *O*_{d2}(*X*_{i}*X*_{j}, *Y*_{k}) to represent its observed frequency, i.e. *O*_{d2}(*X*_{i}*X*_{j}, *Y*_{k})=|*d*_{2}(*X*_{i}*X*_{j}, *Y*_{k})|. This notation also applies to other events in the contingency tables shown in Tables 2–4. For example, *D*(*X*_{i}, *Y*_{k}) represents the subset of individuals satisfying (*X*_{i} = 1 ∧ *Y*_{k}=1), and *O*_{D}(*X*_{i}, *Y*_{k}) = |*D*(*X*_{i}, *Y*_{k})|.

Next, we show that for any SNP pair (*X*_{i}*X*_{j}) and an edge (*X*_{j}*X*_{j}′) *E*(), given *O*_{d2}(*X*_{i}*X*_{j}, *Y*_{k}), how to update the value for *O*_{d2}(*X*_{i}*X*_{j}′, *Y*_{k}). From the contingency tables in Table 2–4, it is easy to see that

and

The following theorem shows that, given *O*_{d2}(*X*_{i}*X*_{j}, *Y*_{k}) and *D*(*X*_{i}, *Y*_{k}), using the genotype difference associated with edge (*X*_{j}*X*_{j}′), we can get the value of *O*_{d2}(*X*_{i}*X*_{j}′, *Y*_{k}).

*For any SNP pair* (*X*_{i}*X*_{j}) *and an edge* (*X*_{j}*X*_{j}′) *E*(), *we have O*_{d2}(*X*_{i}*X*_{j}′, *Y*_{k}) = *O*_{d2}(*X*_{i}*X*_{j}, *Y*_{k})+|*D*(*X*_{i}, *Y*_{k})∩(*X*_{j}*X*_{j}′)_{{0→1}{2→1}}|− |*D*(*X*_{i}, *Y*_{k})∩(*X*_{j}*X*_{j}′)_{{1→0}{1→2}}|.

See Appendix.

*Using the example dataset in* *Table 1*, *let i*=3, *j*=2, *j*′=5, *and k*=4, *i.e. we consider SNP pair* (*X*_{3}*X*_{2}), *permutation Y*_{4} *and the edge* (*X*_{2}*X*_{5}) *in* *Figure 1*. *Suppose that we already know that O*_{d2}(*X*_{3}*X*_{2}, *Y*_{4})=2, *and event D*(*X*_{3}, *Y*_{4})={*S*_{10}, *S*_{13}, *S*_{19}}. *From* *Table 5*, *we have* (*X*_{2}*X*_{5})_{{0→1}{2→1}}={*S*_{7}, *S*_{8}, *S*_{10}}, *and* (*X*_{2}*X*_{5})_{{1→0}{1→2}}={*S*_{13}}. *Thus according to Theorem 5.1*, *we have O*_{d2}(*X*_{3}*X*_{5}, *Y*_{4})=*O*_{d2}(*X*_{3}*X*_{2}, *Y*_{4})+|{*S*_{10}}|−|{*S*_{13}}|=2. *Note that by this way, we get the value of O*_{d2}(*X*_{3}*X*_{5}, *Y*_{4}) *from O*_{d2}(*X*_{3}*X*_{2}, *Y*_{4})*without scanning all individuals*.

So far, we have discussed the procedure to update the value of *O*_{d2}(*X*_{i}*X*_{j}′, *Y*_{k}) from *O*_{d2}(*X*_{i}*X*_{j}, *Y*_{k}) for a specific phenotype permutation *Y*_{k}. This procedure can be easily extended to handle all the permutations. From Theorem 5.1, for any permutation *Y*_{k}, to update the value of *O*_{d2}(*X*_{i}*X*_{j}′, *Y*_{k}) from *O*_{d2}(*X*_{i}*X*_{j}, *Y*_{k}), we need the value of *D*(*X*_{i}, *Y*_{k}) and the genotype difference associated with edge (*X*_{j}*X*_{j}′). Note that the genotype difference is fixed once the minimum spanning tree is built. Next, we discuss how to compute *D*(*X*_{i}, *Y*_{k}) for all permutations {*Y*_{1}, *Y*_{2},…, *Y*_{K}} in a batch mode in detail.

Let *D*_{K}(*X*_{i}) be a list of *M* entries, with each entry corresponding to an individual. For each individual *S*_{m}, we record in *D*_{K}(*X*_{i})[*m*] the set of phenotypes satisfying (*X*_{i} = 1 ∧ *Y*_{k=1}). For example, consider the dataset in Table 1, we have that *D*_{K}(*X*_{3})[8]={*Y*_{2}, *Y*_{3}}. Table 4 shows the entries of *D*_{K}(*X*_{3}). Only non-empty entries, i.e. *D*_{K}(*X*_{i})[*m*]≠, are shown in the table. It is easy to see that, for any *X*_{i} and *Y*_{k}, we can get *D*(*X*_{i}, *Y*_{k}) from *D*_{K}(*X*_{i}) as follows: *D*(*X*_{i}, *Y*_{k}) is the set of individuals whose corresponding entries in *D*_{K}(*X*_{i}) contain *Y*_{k} as an element, i.e.

(1)

For example, using the example dataset in Table 1, from Table 6, we know that *D*(*X*_{3}, *Y*_{4})={*S*_{10}, *S*_{13}, *S*_{19}}.

For SNP pair (*X*_{i}*X*_{j}), let *O*_{d2}(*X*_{i}*X*_{j})=[*O*_{d2}(*X*_{i}*X*_{j}, *Y*_{1}), *O*_{d2}(*X*_{i}*X*_{j}, *Y*_{2}),…, *O*_{d2}(*X*_{i}*X*_{j}, *Y*_{K})]. From Theorem 5.1 and Equation (1), for any SNP pair (*X*_{i}*X*_{j}) and an edge (*X*_{j}*X*_{j}′)*E*(), we can get *O*_{d2}(*X*_{i}*X*_{j}′) from *O*_{d2}(*X*_{i}*X*_{j}) using *D*_{K}(*X*_{i}) and the genotype difference information associated with edge (*X*_{j}*X*_{j}′). First, initialize *O*_{d2}(*X*_{i}*X*_{j}′)=*O*_{d2}(*X*_{i}*X*_{j}). Next, for every *m* (1≤*m*≤*M*), if *Y*_{k} *D*_{K}(*X*_{i})[*m*], we update *O*_{d2}(*X*_{i}*X*_{j}′) as follows:

*Following Example 5.2*, *we consider the two SNP pairs* (*X*_{3}*X*_{2}) *and* (*X*_{3}*X*_{5}), *with* (*X*_{2}*X*_{5}) *being an edge of the tree in* *Figure 1*. *Assume that D*_{K}(*X*_{3}) *is as shown in* *Table 6*, *and O*_{d2}(*X*_{3}*X*_{2})=[1, 1, 1, 2, 1]. *From* *Table 5*, *the genotype difference on edge* (*X*_{2}*X*_{5}) *is* (*X*_{2}*X*_{5})_{{0→1}{2→1}}={*S*_{7}, *S*_{8}, *S*_{10}}, *and* (*X*_{2}*X*_{5})_{{1→0}{1→2}}={*S*_{13}}. *For individual S*_{m} {*S*_{7}, *S*_{8}, *S*_{10}} (*S*_{m} {*S*_{13}}), *we need to increase (decrease) the corresponding values in O*_{d2}(*X*_{3}*X*_{2}) *according to D*_{K}(*X*_{3}). *Table 7* *shows the updating process for O*_{d2}(*X*_{3}*X*_{5}). *Initially*, *O*_{d2}(*X*_{3}*X*_{5})=*O*_{d2}(*X*_{3}*X*_{2}). *For individual S*_{7}, *since its corresponding entry in D*_{K}(*X*3), *D*_{K}(*X*3)[7]=, *O*_{d2}(*X*_{3}*X*_{5}) *remains unchanged. For individual S*_{8}, *D*_{K}(*X*3)[8]={*Y*_{2}, *Y*_{3}}, *we increase the values of O*_{d2}(*X*_{3}*X*_{5}, *Y*_{2}) *and O*_{d2}(*X*_{3}*X*_{5}, *Y*_{3}) *by 1. Similarly, we increase and decrease the values in O*_{d2}(*X*_{3}*X*_{5}) *according to D*_{K}(*X*3) *for S*_{10} *and S*_{13}. *For individual S*_{19}, *we do not have any update because S*_{19} {*S*_{7}, *S*_{8}, *S*_{10}} *and S*_{19} {*S*_{13}}. *The final result is O*_{d2}(*X*_{3}*X*_{5})=[0, 2, 3, 2, 1].

Note that to get the value of *O*_{d2}(*X*_{i}*X*_{j}), using a brute force approach, we need to scan a (2+*K*) × *M* matrix consisting of the genotype of (*X*_{i}*X*_{j}) and permutations {*Y*_{1}, *Y*_{2},…, *Y*_{K}} for the *M* individuals. In the previous example, to compute the value of *O*_{d2}(*X*_{3}*X*_{5}), the cost of the brute force approach is (3+5) × 24=192. Using our approach, the total number of updates is |*D*_{K}(*X*3)[8]|+|*D*_{K}(*X*3)[10]|+|*D*_{K}(*X*3)[13]|=10, which is significantly less than the cost of the brute force approach. More formally, given *D*_{K}(*X*_{i}), the time complexity of updating *O*_{d2}(*X*_{i}*X*_{j}′) from *O*_{d2}(*X*_{i}*X*_{j}) is *O*(*w*(*X*_{j}*X*_{j}′)*K*).

The procedure of updating *O*_{d2}(*X*_{i}*X*_{j}′) from *O*_{d2}(*X*_{i}*X*_{j}) can also be applied to update the remaining free variables *O*_{d3}(*X*_{i}*X*_{j}), *O*_{f2}(*X*_{i}*X*_{j}) and *O*_{f3}(*X*_{i}*X*_{j}). Note that, to update *O*_{f2}(*X*_{i}*X*_{j}), *O*_{f3}(*X*_{i}*X*_{j}), we will need *F*_{K}(*X*_{i}), which can be defined in a similar way to that of *D*_{K}(*X*_{i}): for each individual *S*_{m}, we record in *F*_{K}(*X*_{i})[*m*] the set of phenotypes satisfying (*X*_{i} = 2 ∧ *Y*_{k}=1).

TEAM examines SNP pairs through a double loop, where the outer loop visits a leaf node at a time, and the inner loop traverse the rest of the tree, starting from the parent node of the leaf. Let *O*_{d2d3f2f3}(*X*_{i}*X*_{j})=[*O*_{d2}(*X*_{i}*X*_{j}), *O*_{d3}(*X*_{i}*X*_{j}), *O*_{f2}(*X*_{i}*X*_{j}), *O*_{f3}(*X*_{i}*X*_{j})]. Let *L*() *V*() be the set of leaf nodes of the minimum spanning tree . For any *leaf* node *X*_{i} *L*(), let *AP*(*X*_{i}) = {(*X*_{i}*X*_{j})|*i*≠*j*, *X*_{j} *V*()}. Let *X*_{a} be the parent node of *X*_{i}. Since all SNPs are connected in , once we have *O*_{d2d3f2f3}(*X*_{i}*X*_{a}), we can update all *O*_{d2}(*X*_{i}*X*_{j}) *AP*(*X*_{i}) by enumerating the edges in *E*() in a breath-first traversal starting from *X*_{a}.

*Consider the tree in* *Figure 1*. *Let X*_{i} = *X*_{3} *and X*_{a} = *X*_{2}. *We have AP*(*X*_{3}) = {(*X*_{3}*X*_{2}), (*X*_{3}*X*_{5}), (*X*_{3}*X*_{6}), (*X*_{3}*X*_{1}), (*X*_{3}*X*_{4})}. *Starting from X*_{3}, *a breadth-first search will enumerate edges* {(*X*_{2}*X*_{5}), (*X*_{5}*X*_{6}), (*X*_{6}*X*_{1}), (*X*_{6}*X*_{4})}, *which can be utilized to update O*_{d2d3f2f3}(*X*_{i}*X*_{j}) *for the SNP pairs in AP*(*X*_{3}).

Once the SNP pairs in *AP*(*X*_{i}) have been processed, we delete *X*_{i} from *L*(), and repeat the same process for another leaf node. The overall algorithm is summarized in Algorithm 1. Given the SNPs *X*′ = {*X*_{1}, *X*_{2},…, *X*_{N}}, phenotype permutations *Y*′ = {*Y*_{1}, *Y*_{2},…, *Y*_{K}}, we first enumerate and store all single-locus contingency tables. We then build the minimum spanning tree , with genotype difference associated with each edge. For leaf node *X*_{i}, we compute *D*_{K}(*X*_{i}), *F*_{K}(*X*_{i}) and *O*_{d2d3f2f3}(*X*_{i}*X*_{a}). This information is then used to incrementally update *O*_{d2d3f2f3} (*X*_{i}*X*_{j}′) for all SNP pairs in *AP*(*X*_{i}). After processing *AP*(*X*_{i}), we delete *X*_{i} from and repeat the procedure for the remaining leaf nodes.

Time complexity: the time complexity on generating all single-locus contingency tables and building the minimum spanning tree is *O*(*MNK*) and *O*(*MN*^{2}), respectively. The time complexity to compute *D*_{K}(*X*_{i}) and *F*_{K}(*X*_{i}) for all SNPs is *O*(*MNK*). The total updating cost for all *AP*(*X*_{i}) is *O*(*W*_{}*NK*). Thus, the overall time complexity of TEAM is *O*(*MNK*+*MN*^{2}+*W*_{}*NK*). Note that the complexity of the brute force approach is *O*(*MN*^{2}*K*). The number of SNPs *N* is the dominant factor.

Space complexity: the dataset size is *O*(*M*(*N*+*K*)). The space needed to store all single-locus contingency tables is *O*(*NK*). The size of tree is *O*(*W*_{}). The size of *D*_{K}(*X*_{i}) and *F*_{K}(*X*_{i}) is *O*(*MK*). Thus, the total space complexity of TEAM is *O*(*M*(*N* + *K*)+*K*(*N*+*M*)+*W*_{}).

Note that we can do incremental computation using any exploration order. TEAM utilizes minimum spanning tree to update the contingency tables. The reason is that the cost of such update depends on the difference between the SNPs. The more similar they are, the lower the cost. Since minimum spanning tree has the minimum weight *W*_{} over all spanning trees, using it to guide the computation leads to optimal efficiency. It is not absolutely necessary to use a minimum spanning tree. As long as the tree is close to a minimum spanning tree, we should expect good performance. An implementation issue in building the minimum spanning tree is that we need *O*(*N*^{2}) space to store all pairwise differences between the SNPs. In practise, we divide the SNPs into sub groups of equal size. A minimum spanning tree is built for each group. Then the sub trees are merged to a larger tree by randomly connecting leave nodes. The tree built in this way is an approximate minimum spanning tree. Our focus in this article is not to build an optimal minimum spanning tree, but to use the tree structure for efficient updating. Refer to Eisner (1997) and Graham and Hell (1985) for surveys on minimum spanning tree construction. In the experiments, we show the performance evaluation using different spanning trees.

In this section, we present extensive experimental results on the performance of the TEAM algorithm. TEAM is implemented in C++. We first evaluate the efficiency of TEAM. Then, we present the findings of epistasis detection in simulated human genome-wide study.

We use both simulated human and real mouse for the efficiency evaluation experiments. The experiments are performed on a 2.6 GHz PC with 8G memory running Linux system.

The human datasets are generated by the simulator Hapsample (Wright *et al.*, 2007), which is publicly accessible from the web site http://www.hapsample.org. We evaluate the performance of TEAM by comparing it with the brute force approach since there is no previous algorithm readily applicable to human datasets. Note that the brute force approach is very time consuming, we use a moderate number of SNPs and permutations in the experiments so that the brute force approach can finish in a reasonable amount of time. Unless otherwise specified, the default experimental setting is the following: #individuals = 400, #SNPs = 10 000, #permutations = 100, and the case/control ratio is 1. These experimental settings are chosen to demonstrate the efficiency gain offered by TEAM over the brute force implementation. TEAM can handle much larger datasets. The performance of TEAM is expected to follow the same trends presented in this section.

TEAM contains three major components: building the minimum spanning tree, updating the contingency tables, and calculating the actual test values. Note that TEAM can be applied to any statistics defined on the contingency table. With different statistics, the only difference in runtime would be caused by the last component calculating the statistics. In the experiments, we choose chi-square test as our statistic. Figure 2 shows the running time comparison of TEAM and the brute force approach using different experimental settings. The *y*-axis is in logarithm scale. In these figures, we also show the detailed runtime of these three components.

Comparison between TEAM and the brute force approach on human datasets under various experimental settings: varying the number of SNPs (**a**), individuals (**b**), permutations (**c**) and varying the case/control ratio (**d**).

Table 8 shows the percentage of individuals pruned by TEAM under different experimental settings. Since in theory we can update the contingency tables in any exploration order, in the table, we also show the pruning effect of using a random spanning tree and a linear spanning tree to guide the updating process. The random spanning tree is generated by starting from a randomly picked SNP and growing edges that connect the remaining SNPs in a random order. The linear tree is a single path connecting all SNPs sequentially. From the table, we can see that TEAM prunes more effectively than the other two updating methods. In the table, we also show the ratio of the tree weights and the size of the SNP dataset, i.e. *W*_{}/(*M* × *N*), which is a determining factor of the pruning ratio. Note that varying the number of permutations and the case/control ratio does not effect the tree being built.

Figures 2a depicts the runtime comparison when varying the number of SNPs. TEAM is more than an order of magnitude faster than the brute force approach. Among the three components of TEAM, the procedures on building the minimum spanning tree and calculating test values only take a small portion of the total runtime of TEAM. The runtime of TEAM is dominated by the cost of updating the contingency tables. As will be shown later, TEAM prunes most of the individuals when updating the contingency tables. In Figures 2b–d, we can also observe a similar one to two orders of magnitude speed up of TEAM over the brute force approach when varying the number of individuals, the number of permutations and the case/control ratio.

The mouse datasets is extracted from a set of combined SNPs from the 10 K GNF (http://www.gnf.org/) mouse dataset and the 140 K Broad/MIT mouse dataset (Wade and Daly, 2005). This merged dataset has 1 56 525 SNPs for 71 mouse strains. The missing values in the dataset are imputed using NPUTE (Roberts *et al.*, 2007). We compare TEAM and the recently proposed COE algorithm (Zhang *et al.*, 2009a), which is specifically designed for association study in mouse datasets. The default experimental setting is as follows: #individuals = 70, #SNPs = 10 000, #permutations = 100, and the case/control ratio is 1.

Figure 3 shows the comparison results. In the figure, we also plot the runtime of the brute force approach. Figure 3a shows the runtime of the three approaches when varying the number of SNPs. It is clear that both TEAM and COE are orders of magnitude faster than the brute force approach. TEAM is about twice faster than COE. Figure 3b shows the runtime comparison when varying the number of individuals. From the figure, COE is more suitable for datasets having small number of individual. As the number of individuals increases, the TEAM algorithm becomes more efficient than COE. Note that in human study, the number of individuals usually ranges up to thousands, much larger than that in typical mouse datasets.

In this section, we report the results of epistasis detection using simulated human GWAS data generated by Hapsample. In total, we generate four datasets, each of which has 112 036 SNPs for 250 cases and 250 controls. In each dataset, a disease causal interacting SNP pair is embedded. The embedded SNP pairs are: (rs768529, rs3804940) in dataset 1, (rs10495728, rs521882) in dataset 2, (rs1016836, rs2783130) in dataset 3 and (rs648519, rs1012273) in dataset 4. We use standard chi-square test with 500 permutations. Similar results can be found by using likelihood ratio test.

With an overall FDR threshold of 0.005, Table 9 shows the identified significant SNP pairs using TEAM. TEAM successfully identified the embedded SNP pairs in all simulated datasets. The embedded SNP pairs are labelled with stars ‘*’. The table shows the SNP loci on the genome. For example, in dataset 1, we embed SNP pair rs768529 and rs3804940, which are located on chromosome 1 at position 51 946 762 bp and chromosome 3 at 7 520 545 bp, respectively. The FWER for each reported SNP pair is also shown. Note that, for a SNP pair, an FDR (or FWER) value of 0 indicates that permutation tests do not generate any test value larger than value of the reported SNP pair. In dataset 1, except for the embedded SNP pair (rs768529, rs3804940), five other SNP pairs are also reported. One of the embedded SNP, rs768529, is involved in all the five pairs. A closer look at the other SNPs in the reported SNP pairs shows that they are all adjacent to the embedded SNP rs3804940. The normalized linkage disequilibrium (Lewontin and Kojima, 1960) between rs3804940 and the other five SNPs are *D*′(rs3804940, rs756084)=1, *D*′(rs3804940, rs779742)= 0.477, *D*′(rs3804940, rs1872393)= 0.442, *D*′(rs3804940, rs779744)= 0.442 and *D*′(rs3804940, rs6764561)= 0.454, indicating there is strong linkage disequilibrium between them.

The large number of SNPs genotyped in the genome-wide scale poses great computational challenges in two-locus epistasis detection. The permutation test used for proper error rate controlling makes the problem computationally even more intensive. In this article, we propose an efficient algorithm, TEAM, for epistasis detection human GWAS. TEAM has the same strength as the recently developed epistasis detection methods, i.e. it guarantees to find the optimal solution. Compared with existing methods, TEAM is more efficient in large sample study, and offers broader applicability. Existing methods designed for homozygous SNPs cannot be used for human data where most SNPs are heterozygous. TEAM, on the other hand, can handle both homozygous and heterozygous SNPs. Since it exhaustively enumerate all SNP pairs, TEAM can be used to control the FWER and FDR, both of which are widely used in controlling error in GWAS; while previous methods only control the FWER. Existing methods need to exam the formulation of the statistic. TEAM is focused on efficiently updating contingency tables rather than any specific statistic. It can, therefore, be used for any statistical test based on contingency table regardless of its formulation.

In this artcile, we focus on the disease phenotypes that can be represented as binary variables. Many association studies involve phenotypes measured as continuous variables. We will investigate how to apply the idea of the current algorithm to quantitative phenotypes in the future study.

*Funding*: National Science Foundation (awards IIS0448392, IIS0812464, partially).

*Conflict of Interest*: none declared.

**Proof of Theorem 3.1**

From the four contingency tables shown in Tables 2–4, it is easy to get the following linear equation system:

The rank of the above linear system is 14. We thus take 14 rows {4, 6, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21}, which form a full rank matrix. The row reduced echelon form of this non-redundant linear system is

Thus, we have the following solution:

Clearly, only four variables {*O*_{d2}, *O*_{d3}, *O*_{f2}, *O*_{f3}} are free. Once the values of these free variables are known, the observed frequencies of remaining events in the two-locus contingency table are also known.

**Proof of Theorem 5.1**

It suffices to show that

This is the same as to show that

This is clearly true, hence completes the proof.

- Balding DJ. A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 2006;7:781–791. [PubMed]
- Carlborg O, et al. The use of a genetic algorithm for simultaneous mapping of multiple interacting quantitative trait loci. Genetics. 2000;155:2003–2010. [PubMed]
- Churchill GA, Doerge RW. Empirical threshold values for quantitative trait mapping. Genetics. 1994;138:963–971. [PubMed]
- Cormen TH, et al. Introduction to Algorithms. MIT Press and McGraw-Hill; 2001.
- Dudoit S, Laan MJ. Multiple Testing Procedures with Applications to Genomics. Springer; 2008.
- Eisner J. State-of-the-art algorithms for minimum spanning trees: a tutorial discussion. Manuscript,University of Pennsylvania; 1997.
- Evans DM, et al. Two-stage two-locus models in genome-wide association. PLoS Genet. 2006;2:e157. [PubMed]
- Graham RL, Hell P. On the history of the minimum spanning tree problem. Ann. History Comput. 1985;7:43–57.
- Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 2005;6:95–108. [PubMed]
- Hoh J, Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nat. Rev. Genet. 2003;4:701–709. [PubMed]
- Hoh J, et al. Selecting snps in two-stage analysis of disease association data: a model-free approach. Ann. Hum. Genet. 2000;64:413–417. [PubMed]
- Lewontin RC, Kojima K. The evolutionary dynamics of complex polymorphisms. Evolution. 1960;14:458–472.
- Miller RG. Simultaneous Statistical Inference. New York: Springer; 1981.
- Musani SK, et al. Detection of gene x gene interactions in genome-wide association studies of human population data. Hum. Hered. 2007;63:67–84. [PubMed]
- Nelson MR, et al. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001;11:458–470. [PubMed]
- Ritchie MD, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 2001;69:138–147. [PubMed]
- Roberts A, et al. Inferring missing genotypes in large snp panels using fast nearest-neighbor searches over sliding windows. Proceeding of ISMB. 2007 [PubMed]
- Saxena R, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 2007;316:1331–1336. [PubMed]
- Scuteri A, et al. Genome-wide association scan shows genetic variants in the fto gene are associated with obesity-related traits. PLoS Genet. 2007;3:1200–1210. [PubMed]
- The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed]
- Wade CM, Daly MJ. Genetic variation in laboratory mice. Nat. Genet. 2005;37:1175–1180. [PubMed]
- Weedon M, et al. A common variant of hmga2 is associated with adult and childhood height in the general population. Nat. Genet. 2007;39:1245–1250. [PMC free article] [PubMed]
- Westfall PH, Young SS. Resampling-based Multiple Testing. New York: Wiley; 1993.
- Wright FA, et al. Simulating association studies: a data-based resampling method for candidate regions or whole genome scans. Bioinformatics. 2007;23:2581–2588. [PubMed]
- Yang C, et al. SNPHarvester: a filtering-based approach for detecting epistatic interactions in genomewide association studies. Bioinformatics. 2009;25:504–511. [PubMed]
- Zhang X, et al. FastANOVA: an efficient algorithm for genome-wide association study. Proceeding of KDD. 2008 [PMC free article] [PubMed]
- Zhang X, et al. COE: a general approach for efficient genome-wide two-locus epistatic test in disease association study. Proceeding of RECOMB. 2009a
- Zhang X, et al. FastChi: an efficient algorithm for analyzing gene-gene interactions. Proceeding of PSB. 2009b [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |