Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2951741

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. RELATED WORK
- 3. TWO-LOCUS ANOVA TEST
- 4. THE UPPER BOUND
- 4.2 Bounds of ΔA and ΔB
- 5. THE FASTANOVA ALGORITHM
- 6. EXPERIMENTAL RESULTS
- 7. CONCLUSION AND FUTURE WORK
- 9. REFERENCES

Authors

Related links

KDD. Author manuscript; available in PMC 2010 October 8.

Published in final edited form as:

KDD. 2008 : 821–829.

PMCID: PMC2951741

NIHMSID: NIHMS131999

See other articles in PMC that cite the published article.

Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study.

In this paper, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs.

Quantitative phenotype association study analyzes genetic variation across a population in order to find the genetic factors underlying continuous phenotypes (such as height or weight). These phenotypes are often complex in the sense that they are likely due to the effects of multiple genes [6, 24]. The most abundant source of genetic variation is represented by single nucleotide polymorphisms (SNPs). A SNP is a DNA sequence variation occurring when a single nucleotide (A, T, G, or C) in the genome differs between individuals of a species. For inbred species, a SNP usually shows variation between only two of the four possible nucleotide types [12], which allows us to represent it by a binary variable. The binary representation of a SNP is also referred to as the *genotype* of the SNP.

Various statistics can be applied to measure the association between SNPs and the phenotypes of interest, among which ANOVA (analysis of variance) test is one of the standard statistic methods and has been routinely used in quantitative phenotype association study [18]. The goal of ANOVA test is to determine whether the group means are significantly different after accounting for the variances within groups. It accomplishes the comparison by decomposing the total variance in the data into within-group variance and between-group variance. If the between-group variance is sufficiently larger than the within-group variance, then the test concludes that there is significant (phenotypic) difference between the groups.

In the application of phenotype-SNP association study, the individuals’ phenotype values are grouped by the genotype of a SNP or a subset of SNPs. Figure 1(a) shows an example of strong association between a phenotype and a SNP. 0 and 1 on the x-axis represent the SNP genotype and the y-axis represents the phenotype. Each point in the figure represents an individual. It is clear from the figure that the phenotype values are partitioned into two groups with distinct means, hence indicating a strong association between the phenotype and the SNP. On the other hand, if the genotype of a SNP partitions the phenotype values into groups as shown in Figure 1(b), the phenotype and the SNP are not associated with each other.

Recent advances in high-throughput techniques enable genotyping SNPs in genome-wide scale, resulting in large datasets containing thousands to hundreds of thousands of SNPs [1, 3]. The vast number of SNPs has posed great computational challenge to genome-wide association study. In order to understand the underlying biological mechanisms of complex phenotype, one needs to consider the joint effect of multiple SNPs simultaneously. Although the idea of studying the association between phenotype and multiple SNPs is straightforward, the implementation is nontrivial. For a study with total *N* SNPs, in order to find the association between *n* SNPs and the phenotype, a brute-force approach is to exhaustively enumerate all $\left(\begin{array}{c}\hfill N\hfill \\ \hfill n\hfill \end{array}\right)$ possible SNP combinations and evaluate their associations with the phenotype. The computational burden imposed by this enormous search space often makes the complete genome-wide association study intractable.

The computational challenge of genome-wide association study is further compounded by another well-known statistical problem – the multiple testing problem [14]. The multiple testing problem can be described as the potential increase in Type I error when statistical tests are performed multiple times. Let *α* be the Type I error for each independent test. If *n* independent comparisons are performed, the experimental-wise error *α*′ is given by

$${\alpha}^{\prime}=1-{(1-\alpha )}^{n}.$$

For example, when *α* = 0.05 and *n* = 20, *α*′ = 1 − 0.95^{20} = 0.64. We have 64% probability to get at least one spurious result. Determining the statistical significance of the association between the phenotype and SNPs is crucial. Bonferroni correction based on the assumption that all *n* tests are independent is too conservative for the genome-wise association studies since SNPs are often correlated. Alternatively, permutation procedure can be used and much preferred in association studies which automatically takes the correlation structure of SNPs into consideration.

The idea of permutation is to randomly permute the phenotype hundreds to thousands of times. For each permutated phenotype, the association analysis will be repeated. Then the null distribution of the test statistics is estimated and used to assess the statistical significance of the findings from the original phenotype. However, permutation test is very time-consuming since the test procedure needs to be performed in all permutations in order to find the threshold.

Algorithm development to support these large scale analysis is still in its infancy stage. Most existing work focuses on studying the association between the phenotype and SNP-pairs and can only handle a small number of SNPs. Given a pair of SNPs, the phenotype values can be partitioned into at most four groups by the genotype of the SNP-pair, i.e., 00, 01, 10, and 11. Since each SNP has a distinct location on the genome, the association study of a phenotype and SNP-pairs is also called * two-locus association mapping*. Important findings are appearing in the literature from studying the association between phenotypes and SNP-pairs [21, 22, 27].

Although the standard ANOVA test has been a valuable tool to find association between SNP-pairs and phenotype, it is usually not performed in genome-wide scale. This is due to the fact that the search space of two-locus association mapping in genome-wide scale prohibits an exhaustive search. Suppose that the dataset consists of *N* SNPs and the number of permutations is *K*. The total number of ANOVA tests is *KN*(*N* − 1)/2. Given a moderate number of SNPs *N* = 10, 000 and number of permutations *K* = 1, 000, the number of ANOVA tests is around 5 × 10^{10}. Therefore, ANOVA test is often reserved for validating a small set of candidates identified by other methods [17, 25].

In this paper, we examine the *computational aspect* of ANOVA test. We present an efficient algorithm, FastANOVA, and show that the standard ANOVA test can be applied in genome-wide scale for two-locus association mapping even when the permutation procedure is needed. Unlike algorithms applying heuristics, FastANOVA is a *complete* algorithm, i.e., it guarantees to find the optimal solution, though it does not explicitly examine all possible SNP-pairs. In fact, a large portion of the SNP-pairs are pruned without the need of performing the tests. FastANOVA establishes an upper bound on the two-locus ANOVA test. The upper bound is the sum of two terms: one based on the ANOVA test between phenotype and a single SNP, and the other based on the pair-wise SNP genotype and the ordered phenotype values. This formulation of the upper bound allows the algorithm to calculate the bound for a large number of SNPs together, which enables fast candidate retrieval. Moreover, the intermediate results for calculating the second term of the upper bound is independent of phenotype permutations. Hence they only need to be computed once and can be reused in all permutations. Applying this bound, FastANOVA is able to identify SNP-pairs with significant ANOVA test values using only a small fraction of the time required by performing ANOVA test on all SNP-pairs. The principles developed in FastANOVA are also applicable to the general case of testing SNP subsets containing more than two SNPs.

The problem of phenotype-SNP association study has attracted extensive research interests and is an ongoing research area in biology and statistic communities. In this section, we give a briefly review of the related work from a computational point of view. Please refer to [4, 7, 11] for excellent surveys of existing work.

Under the assumption that the number of SNPs is limited, e.g., from tens to a few hundreds, exhaustive algorithms that explicitly enumerate all possible SNP combinations and evaluate their associations with the phenotype have been developed [16, 19]. These methods are not well adapted to genome-wide association study.

To avoid exhaustively enumerating the search space, a common approach is to break the problem into two steps [8, 10]. First, a subset of important SNPs are selected. Second, within the selected subset, the association between SNPs and the phenotypes are searched. These methods are not complete since the SNPs with weak marginal effects may not be selected in the first place. Genetic algorithm [5, 15] has been applied in finding SNP-pairs for quantitative phenotypes. These methods cannot guarantee to find the optimal solution.

Feature selection methods [13] have been proposed to address the problem of finding important SNPs. In feature selection, the selected feature subset usually contains features that have low correlation with each other but have strong correlation with the target feature. In the application of selecting SNPs, the goal is to select a subset of SNPs that can be used as proxies for all SNPs in the genome [9, 23]. The selected SNPs can then be used as the input SNPs in the association study. Apparently, these methods are also not complete.

Let (*X*_{1}, *X*_{2}, ···, *X*_{N}} be the set of SNPs of *M* individuals (*X*_{i} {0, 1}, 1 ≤ *i* ≤ *N*) and *Y* = {*y*_{1}, *y*_{2}, ···, *y*_{M}} be the quantitative phenotype of interest, where *y*_{m} (1 ≤ *m* ≤ *M*) is the phenotype value of individual *m*.

For any SNP *X*_{i} (1 ≤ *i* ≤ *N*), we represent the F-statistic from the ANOVA test of *X _{i}* and

The basic idea of ANOVA test is to partition the total sum of squared deviations *SS _{T}* into between-group sum of squared deviations

$$S{S}_{T}=S{S}_{B}+S{S}_{W}.$$

In our application of two-locus association study, Table 1(a) and Table 1(b) show the possible groupings of phenotype values by the genotypes of *X _{i}* and (

Let *A*, *B*, *a*_{1}, *a*_{2}, *b*_{1}, *b*_{2} represent the groups as indicated in Table 1(a) and Table 1(b). We use *SS _{B}*(

$$\begin{array}{cc}\hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{thickmathspace}{0ex}}S{S}_{T}({X}_{i},Y)=S{S}_{B}({X}_{i},Y)+S{S}_{W}({X}_{i},Y),\hfill \\ \hfill & S{S}_{T}({X}_{i}{X}_{j},Y)=S{S}_{B}({X}_{i}{X}_{j},Y)+S{S}_{W}({X}_{i}{X}_{j},Y).\hfill \end{array}$$

The F-statistics for ANOVA tests on *X _{i}* and (

$$F({X}_{i},Y)=\frac{M-2}{2-1}\times \frac{S{S}_{B}({X}_{i},Y)}{S{S}_{T}({X}_{i},Y)-S{S}_{B}({X}_{i},Y)},$$

(1)

$$F({X}_{i}{X}_{j},Y)=\frac{M-g}{g-1}\times \frac{S{S}_{B}({X}_{i}{X}_{j},Y)}{S{S}_{T}({X}_{i}{X}_{j},Y)-S{S}_{B}({X}_{i}{X}_{j},Y)},$$

(2)

where *g* in Equation (2) is the number of groups that the genotype of (*X _{i}X_{j}*) partitions the individuals into. Possible values of

Let $T={\sum}_{{y}_{m}\in Y}{y}_{m}$ be be the sum of all phenotype values. The total sum of squared deviations does not depend on the groupings of individuals:

$$S{S}_{T}({X}_{i},Y)=S{S}_{T}({X}_{i}{X}_{j},Y)=\sum _{{y}_{m}\in Y}{y}_{m}^{2}-\frac{{T}^{2}}{M}.$$

Let ${T}_{group}={\sum}_{{y}_{m}\in group}{y}_{m}$ be the sum of phenotype values in a specific group, and *n _{group}* be the number of individuals in that group.

$$S{S}_{B}({X}_{i},Y)=\frac{{T}_{A}^{2}}{{n}_{A}}+\frac{{T}_{B}^{2}}{{n}_{B}}-\frac{{T}^{2}}{M},$$

$$S{S}_{B}({X}_{i}{X}_{j},Y)=\frac{{T}_{{a}_{1}}^{2}}{{n}_{{a}_{1}}}+\frac{{T}_{{a}_{2}}^{2}}{{n}_{{a}_{2}}}+\frac{{T}_{{b}_{1}}^{2}}{{n}_{{b}_{1}}}+\frac{{T}_{{b}_{2}}^{2}}{{n}_{{b}_{2}}}-\frac{{T}^{2}}{M}.$$

Note that for any group of *A*, *B*, *a*_{1}, *a*_{2}, *b*_{1}, *b*_{2}, if *n _{group}* = 0, then $\frac{{T}_{grouop}^{2}}{{n}_{group}}$ is defined to be 0.

The two-locus association mapping with permutation test is typically conducted in the following way [18].

First, for every SNP-pair (*X _{i}X_{j}*) (1 ≤

Second, a permutation test is performed to get a reference distribution in order to assess the statistical significance of previous findings. More specifically, a permutation *Y _{k}* of

$${F}_{{Y}_{k}}=\text{max}\{F({X}_{i}{X}_{j},{Y}_{k})\mid 1\le i<j\le N\}.$$

The distribution of {*F _{Yk}*|

Two computational problems need to be solved in this procedure. The first one is to find the critical value *F _{α}* for a given Type I error threshold

**Problem (1)**: Given the Type I error threshold *α*, find the critical value *F _{α}*, which is the

**Problem (2)**: Given the threshold *F _{α}*, find all SNP-pairs (

A brute force approach to these two problems is to enumerate all SNP-pairs and find their F-statistics. In Problem (1), for each permutation *Y _{k}*

In the reminder of the paper, we first derive an upper bound on two-locus ANOVA test value and discuss how this upper bound enables an efficient ANOVA testing for a single phenotype. Then we show how this approach can be easily extended to handle the permutation procedure.

Since the total sum of squared deviations does not change, from the calculation of *F*(*X _{i}*,

$$\begin{array}{cc}\hfill \Delta A\phantom{\rule{thickmathspace}{0ex}}& =\phantom{\rule{thickmathspace}{0ex}}\frac{{T}_{{a}_{1}}^{2}}{{n}_{{a}_{1}}}+\frac{{T}_{{a}_{2}}^{2}}{{n}_{{a}_{2}}}-\frac{{T}_{A}^{2}}{{n}_{A}}\hfill \\ \hfill & =\phantom{\rule{thickmathspace}{0ex}}\frac{{n}_{{a}_{2}}{T}_{{a}_{1}}^{2}+{n}_{{a}_{1}}{T}_{{a}_{2}}^{2}}{{n}_{{a}_{1}}{n}_{{a}_{2}}}-\frac{{({T}_{{a}_{1}}+{T}_{{a}_{2}})}^{2}}{{n}_{{a}_{1}}+{n}_{{a}_{2}}}\hfill \\ \hfill & =\phantom{\rule{thickmathspace}{0ex}}\frac{{({n}_{{a}_{2}}{T}_{{a}_{1}}-{n}_{{a}_{1}}{T}_{{a}_{2}})}^{2}}{{n}_{{a}_{1}}{n}_{{a}_{2}}{n}_{A}}\hfill \\ \hfill & =\phantom{\rule{thickmathspace}{0ex}}\frac{{({n}_{A}{T}_{{a}_{1}}-{n}_{{a}_{1}}{T}_{A})}^{2}}{{n}_{{a}_{1}}({n}_{A}-{n}_{{a}_{1}}){n}_{A}}.\hfill \end{array}$$

Similarly, we have

$$\Delta B=\frac{{T}_{{b}_{1}}^{2}}{{n}_{{b}_{1}}}+\frac{{T}_{{b}_{2}}^{2}}{{n}_{{b}_{2}}}-\frac{{T}_{B}^{2}}{{n}_{B}}=\frac{{({n}_{B}{T}_{{b}_{1}}-{n}_{{b}_{1}}{T}_{B})}^{2}}{{n}_{{b}_{1}}({n}_{B}-{n}_{{b}_{1}}){n}_{B}}.$$

Thus, *SS _{B}*(

$$S{S}_{B}({X}_{i}{X}_{j},Y)=S{S}_{B}({X}_{i},Y)+\Delta A+\Delta B.$$

(3)

Note that if any one of {*n*_{a1}, *n*_{a2}, *n*_{A}} is 0, then Δ*A* = 0. Similarly, if any one of {*n*_{b1}, *n*_{b2}, *n _{B}*} is 0, then Δ

Next, we develop an upper bound of *SS _{B}*(

Let {*y _{m}*|

$${y}_{{A}_{1}}\le {y}_{{A}_{2}}\le \cdots \le {y}_{{A}_{{n}_{A}}}.$$

The derivative of Δ*A* with respect to *T*_{a1} is:

$$\frac{d\Delta A}{d{T}_{{a}_{1}}}=\frac{2{n}_{A}({n}_{A}{T}_{{a}_{1}}-{n}_{{a}_{1}}{T}_{A})}{{n}_{{a}_{1}}({n}_{A}-{n}_{{a}_{1}}){n}_{A}}.$$

Thus we have

$$\Delta A\phantom{\rule{thickmathspace}{0ex}}\text{monotonically}\{\begin{array}{cc}\phantom{\rule{thickmathspace}{0ex}}\text{increases}\hfill & \text{if}\phantom{\rule{thickmathspace}{0ex}}{T}_{{a}_{1}}\ge \frac{{n}_{{a}_{1}}{T}_{A}}{{n}_{A}};\hfill \\ \phantom{\rule{thickmathspace}{0ex}}\text{decreases}\hfill & \text{if}\phantom{\rule{thickmathspace}{0ex}}{T}_{{a}_{1}}\le \frac{{n}_{{a}_{1}}{T}_{A}}{{n}_{A}}.\hfill \end{array}\phantom{\}}$$

We have the range of *T*_{a1}:

$${T}_{{a}_{1}}\in [{l}_{{a}_{1}},{u}_{{a}_{1}}]=[\sum _{i=1}^{{n}_{{a}_{1}}}{y}_{{A}_{i}},\sum _{i={n}_{A}-{n}_{{a}_{1}}+1}^{{n}_{A}}{y}_{{A}_{i}}].$$

The maximum value of Δ*A* is attained when *T*_{a1} = *l*_{a1} or *T*_{a1} = *u*_{a1}, i.e.,

$$\Delta A\le \frac{\text{max}\{{({n}_{A}{l}_{{a}_{1}}-{n}_{{a}_{1}}{T}_{A})}^{2},{({n}_{A}{u}_{{a}_{1}}-{n}_{{a}_{1}}{T}_{A})}^{2}\}}{{n}_{{a}_{1}}({n}_{A}-{n}_{{a}_{1}}){n}_{A}}.$$

(4)

We use *R*_{1}(*X _{i}X_{j}Y*) to denote this upper bound.

Let {*y _{m}*|

$${y}_{{B}_{1}}\le {y}_{{B}_{2}}\le \cdots \le {y}_{{B}_{{n}_{B}}}.$$

Similarly, we can derive the bound on Δ*B*:

$$\Delta B\le \frac{\text{max}\{{({n}_{B}{l}_{{b}_{1}}-{n}_{{b}_{1}}{T}_{B})}^{2},{({n}_{B}{u}_{{b}_{1}}-{n}_{{b}_{1}}{T}_{B})}^{2}\}}{{n}_{{b}_{1}}({n}_{B}-{n}_{{b}_{1}}){n}_{B}}.$$

(5)

We use *R*_{2}(*X _{i}X_{j}Y*) to denote this upper bound. The symbols used in Inequalities (4) and (5) are summarized in Table 2.

From Equation (3), Inequalities (4) and (5), we have the overall upper bound on *SS _{B}*(

Theorem 4.1. *(Upper bound of **SS _{B}*(

$$S{S}_{B}({X}_{i}{X}_{j},Y)\le S{S}_{B}({X}_{i},Y)+{R}_{1}\left({X}_{i}{X}_{j}Y\right)+{R}_{2}\left({X}_{i}{X}_{j}Y\right).$$

Property 4.2. *The upper bound in Theorem 4.1 is tight.*

The tightness of the bound is obvious from the derivation of the upper bound, since there exists some genotype of SNP-pair (*X _{i}X_{j}*) that makes the equality hold. For the same reason, we have the following property.

Property 4.3. The upper bound in Theorem 4.1 does not exceeds the total sum of squared deviations, i.e.,

$$S{S}_{B}({X}_{i},Y)+{R}_{1}\left({X}_{i}{X}_{j}Y\right)+{R}_{2}\left({X}_{i}{X}_{j}Y\right)\le S{S}_{T}({X}_{i}{X}_{j},Y).$$

In this section, we show how our algorithm FastANOVA utilizes the upper bound in Theorem 4.1 to achieve efficient two-locus ANOVA testing. In Section 5.1, we describe the method for Problem (2) discussed in Section 3, that is, given the threshold *F _{α}*, find all SNP-pairs whose F-statistics are greater than

Given the threshold *F _{α}*, to find all SNP-pairs whose F-statistics are greater than

$$S{S}_{B}({X}_{i}{X}_{j},Y)\ge \frac{S{S}_{T}({X}_{i},Y)}{{\scriptstyle \frac{M-g}{(g-1){F}_{\alpha}}}+1}=\theta .$$

Theorem 4.1 suggests that we only need to compute the F-statistics for the SNP-pairs that satisfy:

$$S{S}_{B}({X}_{i},Y)+{R}_{1}\left({X}_{i}{X}_{j}Y\right)+{R}_{2}\left({X}_{i}{X}_{j}Y\right)\ge \theta .$$

We refer to these SNP-pairs as *candidate* SNP-pairs.

We now discuss how to apply the upper bound in Theorem 4.1 in detail. The set of all SNP-pairs is partitioned into non-overlapping groups such that the upper bound can be readily applied to each group. For every *X _{i}* (1 ≤

$$AP\left({X}_{i}\right)=\{\left({X}_{i}{X}_{j}\right)\mid i+1\le j\le N\}.$$

For all SNP-pairs in *AP*(*X _{i}*),

Note that *n*_{a1} is the number of 1's in *X _{j}* when

Property 5.1. *If there are m 1's and* (*M* − *m*) *0's in* X_{i}*, then for any* (*X _{i}X_{j}*)

To efficiently retrieve the candidates, the SNP-pairs (*X _{i}X_{j}*) in

Example 5.2. Suppose that there are 32 individuals, and the genotype of *X _{i}* consists of half 0's and half 1's. Thus for the SNP-pairs in

Note that for a SNP-pair (*X _{i}X_{j}*)

Property 5.3. *For any SNP **X _{i}*,

The proof of Property 5.3 is straightforward and omitted here. In order to find candidate SNP-pairs, we scan all entries in *Array*(*X _{i}*) to calculate their upper bounds. Since the SNP-pairs indexed by the same entry share the same (

Algorithm 1 describes the FastANOVA algorithm for finding the SNP-pairs whose F-statistics are greater than the threshold *F _{α}*. The inputs of FastANOVA include the

For multiple tests, permutation procedure is often used in genetic analysis for controlling family-wise error rate. For genome-wide association study, permutation is less commonly used because it often entails prohibitively long computation time. Our FastANOVA algorithm makes permutation procedure feasible in genome-wide association study.

Let *Y*′ = {*Y*_{1}, *Y*_{2}, ···, *Y _{K}*} be the

Property 5.4. *For every SNP **X _{i}*,

The correctness of this property relies on the fact that, for any (*X _{i}X_{j}*)

The FastANOVA algorithm for permutation test is described in Algorithm 2. The inputs include the *N* SNPs, *K* phenotype permutations, and the Type I error threshold *α*. The goal is to find the critical value *F _{α}*, which is the

In this section, we study the time and space complexities of the FastANOVA algorithm for permutation test. The complexity for a single phenotype can be analyzed in a similar way.

**Time complexity**: For each *X _{i}*, FastANOVA needs to index (

**Space complexity**: The total number of variables in the dataset, including the SNPs and the phenotype permutations, is *N* + *K*. The maximum space of the indexing structure *Array*(*X _{i}*) is $O({(\lceil \frac{M}{4}\rceil +1)}^{2}+N)$. Note that for each SNP

In this section, we present extensive experimental results on evaluating the performance of the FastANOVA algorithm. We show (1) the runtime comparison between FastANOVA and the brute-force approach under various experimental settings, (2) the punning effect of the upper bound, and (3) the relative computational cost of each component of FastANOVA. FastANOVA is implemented in C++. The experiments are performed on a 2.4 GHz PC with 1G memory running WindowsXP system.

**Dataset**: The SNP dataset used for the experiments is extracted from a set of combined SNPs from the 140k Broad/MIT mouse dataset [26] and 10k GNF [2] mouse dataset. This merged dataset has 156,525 SNPs for 71 individuals. The missing values in the dataset are imputed using NPUTE [20]. We use both real pheno-types and synthetic phenotypes in our experiments. The real phenotype data is available from the Jackson Lab [3].

We use three real phenotypes in our experiments: cardiovascular (blood pressure), metabolism (water intake), and neurosensory (acoustic startle response). Table 3 shows the statistics of the geno-type datasets corresponding to the three phenotypes. The number of SNPs in the table indicates the number of unique SNPs in each genotype dataset.

We first show the results on finding the critical value *F _{α}*, which is more time-consuming than finding the significance SNP-pairs given the critical value

We compare FastANOVA with the brute-force approach under various experimental settings. Since the brute-force approach is very time-consuming, we use a moderate number of SNPs and permutations in the default setting in order to show the performance comparisons. The default setting is as follows: The Type I error threshold *α* = 0.01. The number of permutations is 100. The number of SNP is 10,000 for the two larger datasets of metabolism and neurosensory, and 2,900 for the cardiovascular SNP dataset. These experimental settings are chosen to demonstrate the performance gain and enhanced scalability offered by FastANOVA over the brute-force implementation. FastANOVA can handle much larger SNP panels and larger number of permutation tests. The performance of FastANOVA is expected to follow the same trends presented in the remainder of this section.

Figures 3, ,4,4, and and55 show the running time comparison of FastANOVA and the brute-force approach on the three genotype phenotype datasets using different settings. The y-axis is in logarithm scale. The numbers above the runtime line of FastANOVA indicate the ratio of the runtimes of the brute-force approach over FastANOVA. We terminate the programs that have run over 72 hours without completion.

Performance comparison between FastANOVA and the brute-force approach when varying Type I error thresholds

Performance comparison between FastANOVA and the brute-force approach when varying the number of SNPs

Performance comparison between FastANOVA and the brute-force approach when varying the number of permutations

Figure 3 shows the runtime comparison when varying the Type I error thresholds. For each dataset, the runtime of the brute-force approach does not change over different Type I error thresholds. The runtime of FastANOVA decreases as the threshold decreases. FastANOVA offers 218 fold speedup when *α* = 0.05 and 293 fold speedup when *α* = 0.01 on cardiovascular dataset. We can also ob serve a similar two-orders-of-magnitude speedup in the metabolism and neurosensory datasets. This is consistent with the pruning effect of the upper bound, which will be presented later in this section. In general, the lower the Type I error threshold, the more powerful the pruning effect, hence the faster the algorithm.

Figure 4 depicts the comparison of these two approaches when the number of SNPs changes. From these figures, it is clear that FastANOVA is about two orders of magnitude faster than the brute-force approach. The brute-force approach cannot finish in 72 hours when the number of unique SNPs is greater than 26k in the metabolism dataset and greater than 24k in the neurosensory dataset. We observe that the runtime ratio tends to increase (approaching three-orders-of-magnitude speedup) as the number of SNPs increases. This indicates that the performance gain of FastANOVA is even higher for larger SNP datasets.

Figure 5 shows the runtime comparison when the number of phenotype permutations changes. The runtime of the brute-force approach is linear with respect to the number of permutations. FastANOVA - is consistently two orders of magnitude faster than the brute-force approach. The performance gap increases as the number of permutations increases.

Table 4 shows the percentage of SNP-pairs pruned under different experimental settings. Since the three datasets have different numbers of SNPs, the 1st to 5th rows in the category of “# SNPs” correspond to the settings from left to right on x-axis in each plot in Figure 4. Most SNP-pairs are pruned under all settings. Moreover, as the Type I error threshold decreases, the pruning ratio increases, which is consistent with runtime comparison shown in Figure 3. As the number of SNPs increases, the pruning ratio also increases. This is because, with more SNPs, the dynamic threshold used to prune the search space becomes higher. Hence a larger portion of SNPs are pruned. This is consistent with results shown in Figure 4. Note that from Table 4 we observe that the pruning ratio tends to remain steady when the number of permutations changes. However, we observe that the runtime ratio increases as the number of permutations increases. The reason for these two different trends will become clear after we show the results on the computational cost of each component of FastANOVA in the next subsection.

In this subsection, we study the comparison between FastANOVA and the brute-force approach in finding significant SNP-pairs given a critical value *F _{α}*. Only the original phenotype (without permutations) is used in this procedure. We examine the detailed computation cost of each component of the FastANOVA algorithm. FastANOVA has three major components: building the indexing structure

Due to space limitation, we only show the performance comparison on the metabolism dataset. Similar behaviors are also observed on the other two datasets. The default experimental setting is the same as before. Figure 6(a) and Figure 6(b) show the runtime of these three components when varying the Type I error threshold and number of SNPs in the dataset respectively. Since *F _{α}* is a function of

Sometimes the users may be interested in finding *F _{Yk}* values of all phenotype permutations. In this way, the users can get the critical value

Table 5 shows the pruning ratio of applying the upper bound to the three real phenotype datasets. The experimental setting is the same as the default setting before. As expected, the pruning ratios are slightly lower than those in Table 4, where smaller Type I error thresholds are used to prune the search space. However, the pruning ratios on all three datasets are still above 97%. Moreover, finding all *F _{Yk}* provides the advantage that we can get the

To further study the performance of FastANOVA, we generate three synthetic phenotypes whose values follow three different distributions: uniform, standard normal, and standard exponential distribution. Our purpose is to study the pruning effect of the upper bound under different phenotype distributions. The default setting of the experiments in this subsection is as follows: #individuals = 32, #SNPs=10,000, #permutations=100, *α* = 0.01. There are 60,970 unique SNPs for these 32 individuals.

Table 6 shows the pruning ratio of FastANOVA under different settings using permutation test. In this table, we also include the pruning ratio when the number of individuals varies. We observe that the pruning effects are similar to that of real phenotypes, which indicates that the upper bound pruning is effective and insensitive to different phenotype distributions.

The large number of available SNPs poses great computational challenge to the genome-wide association study. To assess the significance of the findings, permutation test is usually required. These factors make the association study a very time-consuming process. Thus tools that can improve the efficiency of the association study are in demand.

In this paper we present an efficient algorithm, FastANOVA, for genome-wide two-locus ANOVA test. FastANOVA is a complete algorithm which guarantees to find the optimal solution. Experimental results demonstrate that FastANOVA is two to three orders of magnitude faster than the brute-force alternative. The efficiency of FastANOVA is gained from two sources. First, it utilizes an upper bound of the two-locus ANOVA test value to prune a majority of the SNP-pairs. Second, it identifies and reuses computation units that are independent of the phenotype and hence are invariant in permutation test. By eliminating redundant computation of these invariant units, FastANOVA is much more efficient than the brute-force method.

Even though FastANOVA is designed for two-locus association study of binary SNPs, the principles used in FastANOVA are general and applicable to the association study on SNP subsets containing more than two SNPs, and the heterozygous case where SNPs are encoded as {0, 1, 2}. In our future work, we will investigate how to apply these principles for association study considering joint effects of more than two SNPs and the heterozygous case.

This research was partially supported by EPA grant STAR-RD832720, NSF grant IIS-0448392, and a Micosoft New Faculty Fellowship.

**Categories and Subject Descriptors:** H.2.8 [Database Applications]: Data Mining; J.3 [Life and Medical Sciences]: Biology and Genetics

**General Terms:** Algorithm, Performance

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

4. Balding DJ. A tutorial on statistical methods for population association studies. Nature Reviews Genetics. 2006;7(10):781–791. [PubMed]

5. Carlborg O, Andersson L, Kinghorn B. The use of a genetic algorithm for simultaneous mapping of multiple interacting quantitative trait loci. Genetics. 2000;155(4):2003–2010. [PubMed]

6. Carlson CS, Eberle MA, Kruglyak L, Nickerson DA. Mapping complex disease loci in whole-genome association studies. Nature. 2004;429:446–452. [PubMed]

7. Doerge RW. Multifactorial genetics: Mapping and analysis of quantitative trait loci in experimental populations. Nature Reviews Genetics. 2002;3:43–52. [PubMed]

8. Evans DM, Marchini J, Morris AP, Cardon LR. Two-stage two-locus models in genome-wide association. PLoS Genet. 2006;2:e157. [PubMed]

9. Halperin E, Kimmel G, Shamir R. Tag snp selection in genotype data for maximizing snp prediction accuracy. Proc. ISMB. 2005 [PubMed]

10. Hoh J, et al. Selecting snps in two-stage analysis of disease association data: a model-free approach. Ann. Hum. Genet. 2000;64:413–417. [PubMed]

11. Hoh J, Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nature Reviews Genetics. 2003;4:701–709. [PubMed]

12. Ideraabdullah FY, et al. Genetic and haplotype diversity among wild-derived mouse inbred strains. Genome Res. 2004;14(10a):1880–1887. [PubMed]

13. Liu H, Motoda H. Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers; Boston: 1998.

14. Miller RG. Simultaneous Statistical Inference. Springer Verlag; New York: 1981.

15. Nakamichi R, et al. Detection of closely linked multiple quantitative trait loci using a genetic algorithm. Genetics. 2001;158(1):463–475. [PubMed]

16. Nelson MR, Kardia SL, Ferrell RE, Sing CF. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research. 2001;11:458–470. [PubMed]

17. Ohno Y, et al. Selective genotyping with epistasis can be utilized for a major quantitative trait locus mapping in hypertension in rats. Genetics. 2000;155:785–792. [PubMed]

18. Pagano M, Gauvreau K. Principles of Biostatistics. Duxbury Press; Pacific Grove, CA: 2000.

19. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. American Journal of Human Genetics. 2001;69:138–147. [PubMed]

20. Roberts A, McMillan L, Wang W, Parker J, Rusyn I, Threadgill D. Inferring missing genotypes in large snp panels using fast nearest-neighbor searches over sliding windows. Proc. ISMB. 2007 [PubMed]

21. Saxena R, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 2007;316:1331–1336. [PubMed]

22. Scuteri A, et al. Genome-wide association scan shows genetic variants in the fto gene are associated with obesity-related traits. PLoS Genet. 3(7):2007. [PubMed]

23. Sebastiani P, Lazarus R, Weiss ST, Kunkel LM, Kohane IS, Ramoni MF. Minimal haplotype tagging. Proc. Natl. Acad. Sci. USA. 2003;100(17):9900–9905. [PubMed]

24. Segrè D, DeLuna A, Church GM, Kishony R. Modular epistasis in yeast metabolism. Nat. Genet. 2005;37:77–83. [PubMed]

25. Shimomura K, et al. Genome-wide epistatic interaction analysis reveals complex genetic determinants of circadian behavior in mice. Genome Res. 2001;11(6):959–980. [PubMed]

26. Wade CM, Daly MJ. Genetic variation in laboratory mice. Nat. Genet. 2005;37:1175–1180. [PubMed]

27. Weedon MN, et al. A common variant of hmga2 is associated with adult and childhood height in the general population. Nat. Genet. 2007;39:1245–1250. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |