Home | About | Journals | Submit | Contact Us | Français |

**|**Hum Hered**|**PMC2880732

Formats

Article sections

Authors

Related links

Hum Hered. 2009 October; 69(1): 14–27.

Published online 2009 October 2. doi: 10.1159/000243150

PMCID: PMC2880732

*Jeesun Jung, PhD, Department of Medical & Molecular Genetics, School of Medicine, Indiana University, Indianapolis, IN (USA), Tel. +1 317 274 3688, Fax +1 317 278 9217, E-Mail ude.iupui@gnujeej

Received 2008 November 24; Accepted 2009 April 21.

Copyright © 2009 by S. Karger AG, Basel

In case-control studies identifying disease susceptibility loci, it has been shown that the interaction caused by multiple single nucleotide polymorphisms (SNPs) within a gene as well as by SNPs at unlinked genes plays an important role in influencing risk of a disease. A novel statistical approach is proposed to detect gene-gene interactions at the allelic level contributing to a disease trait. With a new allelic score inferred from the observed genotypes at two or more unlinked SNPs, we derive a score test from logistic regression and test for association of the allelic scores with a disease trait. Furthermore, F and likelihood ratio tests are derived from Cochran-Armitage regression. By testing for the association, the interaction can be assessed both in cases where the SNP association can be detected and cannot be detected as a main effect in single SNP approach. The analytical power and type I error rates over 6 two-way interaction models are investigated based on the non-centrality parameter approximation of the score test. Simulation studies demonstrate that (1) the power of the score test is asymptotically equivalent to that of the test statistics by the Cochran-Armitage method and (2) the allelic based method provides higher power than two genotypic based methods.

With the advance of high-throughput sequencing technologies, genetic association studies to identify susceptibility genes of a common, complex, human disease has been promoted. It has been known that multiple genes along with the environment play interactive roles in contribution to the development of complex diseases. Most ongoing genetic studies have focused on identifying the effect of a single gene; these studies often fail to identify causal association with a disease due to the fact that a disease is the consequence of a complicated network of multiple susceptibility loci, each of which is likely to have only a small effect when considered alone.

Recently many statistical methods have been proposed to detect gene-gene interactions in a case-control study. As a machine learning and data mining approach, multifactor-dimensionality reduction (MDR) was proposed to pool multilocus genotypes into high-risk and low-risk groups and analyze the datasets through cross-validation to calculate prediction accuracy of the testing SNPs [1]. Classical decision tree approaches such as classification and regression trees [2] and support vector machines [3] classify subjects into case and control groups by an optimal hyperplane predicted by SNPs combination. Similar to other statistical model-based methods, the logit regression method [4] is a generalized regression model that uses a ranking criterion to identify the best model. The most commonly used method is a multiplicative method in a standard logistic model framework which consists of main effects and multiplicative interaction effects of SNPs.

Most currently available methods of either machine learning approaches or model based approaches focus on the association of genotypic combinations of multiple SNPs with a disease trait. Jung et al. [5] proposed a novel approach to search for the interaction at the allelic level for quantitative trait loci (QTL). As stated in Jung et al. [5], there are several differences in comparison between genotype-based methods and allelic-based method. Genotypic-based approaches can increase the degrees of freedom of a test statistic, resulting in a concomitant loss of power when multiple unlinked markers (i.e. more than 3) are tested for interaction. In contrast, the allelic-based method can reduce the number of degrees of freedom by accounting for the allelic levels. Furthermore, the power of the genotypic-based method may be compromised by sparse or empty combination caused by multiple markers with small sample sizes or a lower minor allele frequency (≤ 0.1). However, the allelic-based method is relatively robust even with the lower minor allele frequency and small sample sizes.

In this paper, we extend our previous work detecting interaction of quantitative trait loci (QTL) to a case-control study design comparing those with and those without the disease. By a new definition of the allelic-based gene-gene interaction, which is the nonrandom association of alleles occurring when a particular allele in one gene and a particular allele in another unlinked gene contribute to the risk of a disease trait through their interaction [6], the proposed method tests for association of these allelic scores assigned to each subject with a disease trait. We derive a score test based on the standard logistic regression model and both F test and likelihood ratio test (LRT) statistics based on the Cochran-Armitage (CA) regression model. In addition, the non-centrality parameters approximation of these statistics were derived to evaluate the analytical properties of the test statistics. We perform simulation studies to demonstrate the analytical properties of the score test and CA-based tests and to compare the power of the allelic-based method with that of the genotypic-based methods. The 7 candidate genes of major depression disorder (MDD) case-control data provided in dbGaP [7] were applied to show the feasibilities of the proposed method.

The allelic-based gene-gene approach proposes to test for non-random association of the allelic combination with a disease trait. The allelic combination of each subject is scored by a conditional probability of having the particular allelic combination given the observed genotypes at the putative unlinked loci of each subject.

Let *D**_{1}, *D**_{2} be two disease loci underlying a disease trait, which are both in Hardy-Weinberg equilibrium (HWE) and are unlinked but associated with a disease through their interaction. Let *D*_{1} and *d*_{1} be two alleles at the first disease locus *D**_{1}, with frequencies *p*_{D1} and *p*_{d1} respectively. Let *D*_{2} and *d*_{2} be two alleles at the second disease locus *D**_{2}, with frequencies *p*_{D2} and *p*_{d2}. We consider two observed marker loci, each of which is in linkage disequilibrium (LD) with either of the two interacting disease loci; Assume that marker *M*_{1} is in LD with the susceptibility locus *D**_{1} and marker *M*_{2} is in LD with the other susceptibility locus *D**_{2} and both are unlinked. An observed marker *M*_{1} has two alleles *A* and *a* with frequencies *p*_{A} and *p*_{a} respectively, and *M*_{2} has two alleles *B* and *b* with frequencies *p*_{B} and *p*_{b} respectively. Let *D*_{AD1} = *P* (*AD*_{1}) – *P*_{A}*P*_{D1} be the measure of LD between *D**_{1} and *M*_{1}, and *D*_{BD2} = *P* (*BD*_{2}) – *P*_{B}*P*_{D2} be LD measure between *D**_{2} and *M*_{2}. Denote the penetrance of disease genotype of *D**_{1} by *f*_{kl} = *p* (*Y* = 1 | *D**_{1} (*D*_{1}*D*_{1}, *D*_{1}*d*_{1}, *d*_{1}*d*_{1})), *k*, *l* (*D*_{1}, *d*_{1}), and that of *D**_{2} by *f*_{mn} = *p* (*Y* = 1 | *D**_{2} (*D*_{2}*D*_{2}, *D*_{2}*d*_{2}, *d*_{2}*d*_{2})), *m*, *n* (*D*_{2}, *d*_{2}) respectively. Let *f*_{klmn} = *P* (*Y* = 1 | *D**_{1}, *D**_{2}) be the joint penetrance for the two disease loci *D**_{1} and *D**_{2} *k*, *l* (*D*_{1}, *d*_{1}) *m*, *n* (*D*_{2}, *d*_{2}). As shown in figure figure1,1, the penetrance of a particular allelic combination is calculated using the joint penetrance of genotypic combinations consisting of the particular alleles. Therefore, we can derive the penetrance of two *D*_{1}*D*_{2} allelic combination by *f*_{D1D2} = *P* (*Y* = 1 | *D*_{1}*D*_{2}) = *f*_{D1D1D2D2} *p*_{D1} *p*_{D2} + *f*_{D1D1D2d2} *p*_{D1} *p*_{d2} + *f*_{D1d1D2D2} *p*_{d1} *p*_{D2} + *f*_{D1d1D2d2} *p*_{d1} *p*_{d2}. In the same way *f*_{D1d2}, *f*_{d1D2}, *f*_{d1d2} can also be calculated. For a control group, the penetrance *f* is replaced by 1 – *f* such as 1 – *f*_{kl} = *p* (*Y* = 0 | *D**_{1} (*D*_{1}, *D*_{1}, *D*_{1}*d*_{1}, *d*_{1}*d*_{1})) and 1 – *f*_{mn} = *p* (*Y* = 0 | *D**_{2} (*D*_{2}, *D*_{2}, *D*_{2}*d*_{2}, *d*_{2}*d*_{2})), *k*, *l* (*D*_{1}, *d*_{1}), *m*, *n* (*D*_{2}, *d*_{2}) for the two genes.

Denote *y*_{i} = 1 if the *i*-th subject is affected by a disease (case) and *y*_{i} = 0 if the *i*-th subject is not affected by the disease (control). In the non-parametric maximum likelihood solution that allows an arbitrary covariate distribution, Anderson [8], Prentice and Pyke [9] and Scott and Wild [10] showed that fitting a standard prospective logistic regression in a case-control sampling design is equivalent to fitting a retrospective logistic regression except that an intercept in the case-control sampling needs the information of sampling fraction of cases and controls. Therefore, the standard logistic regression model is used due to the equivalence in the parameter estimates of interaction effect. The likelihood function of the standard logistic regression is

$$L\left(\alpha ,{\beta}_{AB},{\beta}_{Ab},{\beta}_{aB}\right)=\prod _{i=1}^{N}\left\{{\left[\text{P}\left({y}_{i}=1|{X}_{i,AB},{X}_{i,aB},{X}_{i,Ab},{w}_{i}\right)\right]}^{{y}_{i}}\times {\left[1-\text{P}\left({y}_{i}=1|{X}_{i,AB},{X}_{i,aB},{X}_{i,Ab},{w}_{i}\right)\right]}^{1-{y}_{i}}\right\},$$

(1)

where

$$\text{P}\left({y}_{i}=1|{X}_{i,AB},{X}_{i,aB},{X}_{i,Ab},{w}_{i}\right)=exp\left(\alpha +{w}_{1}\gamma +{\Sigma}_{k=\left\{AB,aB,Ab\right\}}{X}_{i,k}{\beta}_{k}\right)/\left[1+exp\left(\alpha +{w}_{i}\gamma +{\Sigma}_{k=\left\{AB,aB,Ab\right\}}{X}_{i,k}{\beta}_{k}\right)\right].$$

*X*_{i,AB} = P(*AB* | *M*_{1}, *M*_{2}) is an allelic score of AB allelic combination from *M*_{1} and *M*_{2} genotypes of the *i*-th subject and β_{k {AB,Ab,aB}} is the interaction effects of the *k*-th { *AB*, *Ab*, *aB* } allelic combination. *wi* is a covariate such as age or gender and is a coefficient of the covariate. The overall proportion of *y* is *R* / *N*, where *R* is the number of case subjects and *N* is the number of total subjects. As in Jung et al. [5], table table11 presents the allelic scores given the observed genotypes of two markers. The number of allelic combinations at two markers is 2^{2} but since the probability of a particular allele combination is a function of the rest of the probability of three probabilities, so the number of independent variables is 2^{2} − 1. The null hypothesis (*H*_{0} : β_{AB} = β_{Ab} = β_{aB} = 0) is that there is no association caused by the interaction; therefore, disease risk of the four allelic combinations are not significantly different. Rejecting the null hypothesis would indicate there is an allelic interaction between two markers, especially particular allelic combinations generated by the different genetic characteristics between cases and controls.

Under the assumption of no covariates from (1), an intercept is a nuisance parameter. Let *U*^{τ} = (*U*_{AB}, *U _{Ab}*,

$$S={U}^{\tau}{V}^{-1}U,$$

(2)

where

$${U}^{\tau}=U\beta \left({\beta}_{{H}_{0}},\stackrel{\u02c6}{\alpha}\right)={\left(\Sigma {x}_{i,AB}\left({y}_{i}-\overline{y}\right),\Sigma {x}_{i,Ab}\left({y}_{i}-\overline{y}\right),\Sigma {x}_{i,aB}\left({y}_{i}-\overline{y}\right)\right)}^{\tau}$$

and *V*^{− 1} is the submatrix of *I*^{–1} (α, β_{AB}, β_{Ab}, β_{aB}) corresponding to *U*^{τ} = (*U*_{AB}, *U*_{Ab}, *U*_{aB})^{τ}. Appendix A shows the detailed derivation of *U*^{τ} and *I* (α, β_{AB}, β_{Ab}, β_{aB}). The test statistic *S* is a central chi-square distribution on 3 degrees of freedom.

To investigate the analytical power of the score test, we need to approximate a distribution of *S* under the alternative hypothesis [11]. The approximation to *U* under the alternative hypothesis is

$${U}^{*}={U}_{\beta}\left({\beta}_{{H}_{0}},{\alpha}^{*}\right)-{I}_{\beta \alpha}^{*}{I}_{\alpha \alpha}^{*-1}{U}_{\alpha}\left({\beta}_{{H}_{0}},{\alpha}^{*}\right)={\Sigma}_{i}\left({y}_{i}-\overline{y}\right)\left[\left(\begin{array}{c}{X}_{i,AB}\\ {X}_{i,Ab}\\ {X}_{i,aB}\end{array}\right)-\left(\begin{array}{c}{\overline{x}}_{AB}\\ {\overline{x}}_{Ab}\\ {\overline{x}}_{aB}\end{array}\right)\right],$$

(3)

where *U*_{α} (β_{H0}, α*) = log *L*/α, *I**_{βα}, *I**_{αα} are a submatrix of the Fisher information matrix, *I*, respectively (see Appendix A) and α* is a solution of the equation 0 = lim *n*^{−1}*E* [*U*_{α} (β_{H0},α)]. Let Ω and Σ denote the mean and covariance matrix of *U** [11]. The distribution of *S** = *U*^{*τ} Σ^{−1}*U** is approximately a chi-square distribution on 3 degrees of freedom with non-centrality parameter,

$${\lambda}_{score}={\Omega}^{\tau}{\Sigma}^{-1}\Omega ,$$

where

$$\Omega =E\left({U}^{*}\right)=\left[\left({f}_{{D}_{1}{D}_{2}}-{f}_{{D}_{1}{d}_{2}}-{f}_{{d}_{1}{D}_{2}}-{f}_{{d}_{1}{d}_{2}}\right)\left(\begin{array}{ll}{D}_{A{D}_{1}}\hfill & {D}_{B{D}_{2}}\hfill \\ {D}_{A{D}_{1}}\hfill & {D}_{b{D}_{2}}\hfill \\ {D}_{a{D}_{1}}\hfill & {D}_{B{D}_{2}}\hfill \end{array}\right)+\left({f}_{{D}_{1}}-{f}_{{d}_{1}}\right)\left(\begin{array}{ll}{D}_{A{D}_{1}}\hfill & {P}_{B}\hfill \\ {D}_{A{D}_{1}}\hfill & {P}_{b}\hfill \\ {D}_{a{D}_{1}}\hfill & {P}_{B}\hfill \end{array}\right)+\left({f}_{{D}_{2}}-{f}_{{d}_{2}}\right)\left(\begin{array}{ll}{D}_{B{D}_{2}}\hfill & {P}_{A}\hfill \\ {D}_{b{D}_{2}}\hfill & {P}_{A}\hfill \\ {D}_{B{D}_{2}}\hfill & {P}_{a}\hfill \end{array}\right)\right],$$

$$\begin{array}{l}\Sigma =V\left({U}^{*}\right)=\\ N\Phi \left(1-\Phi \right)\left[\left(\begin{array}{lll}{P}_{A}{P}_{B}\left(1-0.5{P}_{a}\right)\left(1-0.5{P}_{b}\right)\hfill & {P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{a}\right)\hfill & {P}_{A}{P}_{B}{P}_{a}\left(1-0.5{P}_{b}\right)\hfill \\ {P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{a}\right)\hfill & {P}_{A}{P}_{b}\left(1-0.5{P}_{a}\right)\left(1-0.5{P}_{B}\right)\hfill & {P}_{A}{P}_{B}{P}_{a}{P}_{b}\hfill \\ {P}_{A}{P}_{B}{P}_{a}\left(1-0.5{P}_{b}\right)\hfill & {P}_{A}{P}_{B}{P}_{a}{P}_{b}\hfill & {P}_{a}{P}_{B}\left(1-0.5{P}_{A}\right)\left(1-0.5{P}_{b}\right)\hfill \end{array}\right)-\Phi \left(1-\Phi \right){P}_{X}^{t}{P}_{X}\right],\end{array}$$

where Φ = *R* / *N*. Furthermore, (*f*_{D1} – *fd*_{1}), (*f*_{D2} – *f*_{d2}) can be explained by the average effect of the gene substitution at each *D**_{1}, *D**_{2}, [11]. (*f*_{D1D2} – *f*_{D1d2} – *f*_{d1D2} + *f*_{d1d2}) is the magnitude of interaction effect. The non-centrality parameter is also a function of LD between a disease locus and a marker. Appendix B shows the detailed derivation of Ω and Σ. We can further investigate which allelic combinations are interacting to influence a disease trait by testing the allelic specific interaction effects under the null hypothesis of *H*_{0} : β_{C} = 0, in the logistic model with the allelic combinations of interest, logit[P(*y*_{i} = 1 | *w*_{i}, *X*_{i,C})] = α + *w*_{i}γ + *X*_{i,C} β_{C}. In the extension to multiple markers, the assignment of allelic scores to each subject and modeling logistic regression are straightforward.

With the same allelic scores at two markers in table table1,1, we can model a linear trend of proportions of cases to total (sum of case and control) at each allelic combination, *p*_{k,j} = *r*_{k,j} / *n*_{k, j}, where *n*_{k,j} = *r*_{k,j} + *s*_{k, j} for *k* (*AB*, *Ab*, *aB*, *ab*), *j* (0,1/4,1/2,1). *r*_{k,j}, *s*_{k,j} are the number of affected subjects and unaffected subjects having *j* score at *k* allelic combination respectively. Table Table22 shows the distribution of the allelic combination AB, *Z*_{AB,j} with the order of (*Z*_{AB,0} < *Z*_{AB,1/4} < … < *Z*_{AB,1}) in case-control data and the remaining combinations *Z*_{Ab,j}, *Z*_{aB,j} can be expressed in the same way. It has been shown that regressing *p*_{k,j} on *Z*_{AB,j}, *Z*_{Ab,j}, *Z*_{aB,j} is equivalent to regressing *y*_{i} on *Z*_{AB,j}, *Z*_{Ab,j}, *Z*_{aB,j} [13, 14].

As an extension of CA method, the interaction effect of two markers can be modeled as

$${y}_{i}=\alpha +{w}_{i}\gamma +{Z}_{AB,i}{\beta}_{AB}+{Z}_{Ab,i}{\beta}_{Ab}+{Z}_{aB,i}{\beta}_{aB}+\epsilon ,$$

(4)

where *w*_{i} is a covariate such as age or gender and γ is the coefficient of covariate. Under the assumption of no covariates for simplicity, the theoretical regression coefficients can be derived by functions of the genetic effects and of linkage disequilibrium (LD) between a marker and a disease susceptibility locus as follows.

$$\beta =\left(\begin{array}{c}\alpha \\ {\beta}_{AB}\\ {\beta}_{Ab}\\ {\beta}_{aB}\end{array}\right)={K}^{-1}\left[\frac{R}{N}\left(\begin{array}{c}1\\ {P}_{A}{P}_{B}\\ {P}_{A}{P}_{b}\\ {P}_{a}{P}_{B}\end{array}\right)+\left({f}_{{D}_{1}{D}_{2}}-{f}_{{D}_{1}{d}_{2}}-{f}_{{d}_{1}{D}_{2}}+{f}_{{d}_{1}{d}_{2}}\right)\left(\begin{array}{c}0\\ {D}_{A{D}_{1}}{D}_{B{D}_{2}}\\ {D}_{A{D}_{1}}{D}_{b{D}_{2}}\\ {D}_{a{D}_{1}}{D}_{B{D}_{2}}\end{array}\right)+\left({f}_{{D}_{1}}-{f}_{{d}_{1}}\right)\left(\begin{array}{c}0\\ {D}_{A{D}_{1}}{P}_{B}\\ {D}_{A{D}_{1}}{P}_{b}\\ {D}_{a{D}_{1}}{P}_{B}\end{array}\right)+\left({f}_{{D}_{2}}-{f}_{{d}_{2}}\right)\left(\begin{array}{c}0\\ {D}_{B{D}_{2}}{P}_{A}\\ {D}_{b{D}_{2}}{P}_{A}\\ {D}_{B{D}_{2}}{P}_{a}\end{array}\right)\right],$$

where

$$K=\left(\begin{array}{llll}1\hfill & {P}_{A}{P}_{B}\hfill & {P}_{A}{P}_{b}\hfill & {P}_{a}{P}_{B}\hfill \\ {P}_{A}{P}_{B}\hfill & {P}_{A}{P}_{B}\left(1-0.5{P}_{a}\right)\left(1-0.5{P}_{b}\right)\hfill & 0.5{P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{a}\right)\hfill & 0.5{P}_{A}{P}_{B}{P}_{a}\left(1-0.5{P}_{b}\right)\hfill \\ {P}_{A}{P}_{b}\hfill & 0.5{P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{a}\right)\hfill & {P}_{A}{P}_{b}\left(1-0.5{P}_{a}\right)\left(1-0.5{P}_{B}\right)\hfill & 0.25{P}_{A}{P}_{B}{P}_{a}{P}_{b}\hfill \\ {P}_{a}{P}_{B}\hfill & 0.5{P}_{A}{P}_{B}{P}_{a}\left(1-0.5{P}_{b}\right)\hfill & 0.25{P}_{A}{P}_{B}{P}_{a}{P}_{b}\hfill & {P}_{a}{P}_{B}\left(1-0.5{P}_{A}\right)\left(1-0.5{P}_{b}\right)\hfill \end{array}\right).$$

Likewise, (*f*_{D1} – *f*_{d1}), *f*_{D2} – *f*_{d2}) are the average effect of the gene substitution at each *D**_{1}, *D**_{2} and (*f*_{D1D2} – *f*_{D1d2} – *f*_{d1D2} + *f*_{d1d2}) is the magnitude of interaction effect. Appendix C shows the detailed derivation of the theoretical regression coefficients of the model (4). The global test statistic for the interaction effect is

$${F}_{CA}=\frac{{\left(H\stackrel{\u02c6}{\beta}\right)}^{t}{\left[H{\left({X}^{t}X\right)}^{-1}{H}^{t}\right]}^{-1}\left(H\stackrel{\u02c6}{\beta}\right)}{{Y}^{t}\left[{I}_{N}-X{\left({X}^{t}X\right)}^{-1}{X}^{t}\right]Y}\frac{N-4}{3},$$

where *I*_{N} is the *N* × *N* identity matrix and a test matrix is

$$H=\left(\begin{array}{llll}0\hfill & 1\hfill & 0\hfill & 0\hfill \\ 0\hfill & 0\hfill & 1\hfill & 0\hfill \\ 0\hfill & 0\hfill & 0\hfill & 1\hfill \end{array}\right).$$

The F_{CA} follows *F* (3, *N* − 4) with non-centrality parameter λ_{CA} = 0 under the null hypothesis *H*_{0} : β_{AB} = β_{Ab} = β_{aB} = 0. Under the alternative hypothesis, F_{CA} is noncentral to *F*(3, *N*– 4) with its non-centrality parameter,

$${\lambda}_{CA}\approx \frac{R+S}{{\sigma}^{2}}\left({\beta}_{AB},{\beta}_{Ab},{\beta}_{aB}\right){\left[HE{\left({X}^{t}X\right)}^{-1}{H}^{t}\right]}^{-1}{\left({\beta}_{AB},{\beta}_{Ab},{\beta}_{aB}\right)}^{t},$$

where σ^{2} is the total residual variance and *N* = *R* + *S*. Furthermore, an LRT can be derived for the global interaction test of *H*_{0} : β_{AB} = β_{aB} = β_{Ab} = 0, which is 2(*L*_{HA} – *L*_{H0}) following χ^{2}_{df = 3}. *L*_{HA} is a likelihood function of maximum likelihood estimates (MLE) $\stackrel{\u02c6}{\alpha},{\stackrel{\u02c6}{\beta}}_{AB},{\stackrel{\u02c6}{\beta}}_{Ab},{\stackrel{\u02c6}{\beta}}_{aB}$ under the alternative hypothesis and *L*_{H0} is a likelihood function of MLE of $\stackrel{\u02c6}{\alpha}$ under the null hypothesis. Likewise, a specific allelic interaction effect under the null hypothesis of *H*_{0} : β_{C} = 0 for *y*_{i} = α + *w*_{i} γ + *X*_{i,C}β_{C} can be tested in the same framework.

Due to the possibility that the significant allelic interaction results from the additive effect of one or both SNPs, the allelic based model needs to be compared to a main effects model in order to distinguish the real allelic interaction from the main effects. First, we compare the two-way interaction model with a main effect model consisting only of main effects of two putative SNPs. When a logistic regression model is utilized, the criteria for the model comparison are Akaike Information Criterion (AIC) and the smallest Bayesian Information Criterion (BIC) because of the non-nested terms of an interaction model and a main effect model. When CA regression is utilized, an artificial nesting approach called a *J* test [15] can be employed as follows:

$${y}_{i}=\left(1-\alpha \right)\cdot f\left(X,\beta \right)+\alpha \cdot g\left(Z,\stackrel{\u02c6}{\delta}\right),$$

where $f\left(X,\beta \right)=\mu +{X}_{i,AB}{\beta}_{AB}+{X}_{i,Ab}{\beta}_{Ab}+{X}_{i,aB}{\beta}_{aB}$ is a two-way interaction model and *g*(*X*,δ) = μ + δ_{A}*Z*_{M1} + δ_{B}*Z*_{M2} is a main effect model. Since *f* is linear in β, the comparison requires that one estimates and then fits a linear regression and test for α = 0 using the ordinary *t*-statistic [15]. In addition to the test of α, AIC and BIC are also the criteria to select the best models. The same criteria are applied to the three-way interaction model comparison.

To evaluate the statistical robustness of the score test, the F test and the LRT for the allelic based method, simulation studies were carried out to estimate type I error rates at the 1%, 5% significance level. Six models of interaction between two unlinked disease loci were considered (see table table3).3). Most of the models were designed based on combinations of dominant and recessive inheritance at the genotypic level at each marker. These models are (1) Dominant or Recessive (Dom Rec); (2) Recessive or Recessive (Rec Rec); (3) Dominant and Dominant (Dom ∩ Dom); (4) Dominant and Recessive (Dom ∩ Rec); (5) Threshold whose disease risk is increased when two or more high risk alleles from either locus are present, and (6) Modified model whose homozygosity at either locus confers disease risk. For each model, we simulated 5,000 datasets using SNaP [16]. Each dataset has 200 cases and 200 controls with two unlinked genes under no LD between markers and disease loci (*D*′_{AD1} = *D*_{AD1} / *D*_{max} = 0; *D*′_{BD2} = *D*_{BD2} / *D*_{max} = 0; *R*^{2}_{AD1} = *R*^{2}_{BD2} = 0), disease risk allele frequencies at two loci of 0.2 (*P*_{D1} = *P*_{D2} = 0.2), and minor allele frequencies at M_{1} and M_{2} of 0.3 (*P*_{A} = *P*_{B} = 0.3). Table Table44 shows the results of the empirical type I error rates of the score test and the F and LRT statistic at 1 and 5% significance level. It illustrates that all type I error rates of the three statistics at each model were close to the nominal values 1% and 5% which suggest that the proposed method is statistically robust. For different parameters such as 0.5 disease-associated allele frequencies (*P*_{D1} = *P*_{D2} = 0.5) and equal allelic frequencies of the two markers (*P*_{A} = *P*_{B} = 0.5), all models were also close to nominal values of 1 and 5% as in table table44 (results not shown).

The analytical power and the required sample size of the proposed approach were calculated based on the non-centrality parameter λ_{score} = Ω^{τ}Σ^{−1}Ω of the score test. The power to detect the interaction effect was influenced by various parameters such as LD between the putative disease loci and a marker (*D*_{AD1}, *D*_{BD2}), disease allele frequencies (*p*_{D1}, *p*_{D2}), observed marker allele frequencies (*p*_{A}, *p*_{B}), and penetrance (*f*) which is a proportion of individuals carrying a genotypic combination of two putative disease loci that also develop a disease. For the six two-loci interaction models, the power is calculated against each parameter respectively while the other remaining parameters are fixed at a number that provides the greatest variation across the interaction models.

Figure Figure22 shows the power curve of the score test *S* plotted against penetrance *f* (see table table3)3) at the 1% significance level. Power is calculated when sample size is 1,000 cases and 1,000 controls with a fixed value *D*_{AD1} = *D*_{BD2} = 0.05, *P*_{A} = *P*_{a} = 0.5; *P*_{B} = *P*_{b} = 0.5. Models ‘Dom ∩ Dom’ and ‘Threshold’ equivalently achieved the highest power and ‘Modified’ obtained the lowest power across the full range of penetrance *f.* The power of ‘Rec Rec’ is higher than that of ‘Dom ∩ Rec’ which in turn is higher than that of ‘Dom Rec’ model. Figure Figure33 presents the power against the A allele frequency (*p*_{A}) of a marker *M*_{1}, when *p*_{B} = 0.5 at a marker *M*_{2}. All parameters used in figure figure33 are the same as those used in figure figure22 except *D*_{AD1} = {min(*P*_{D1}, *P*_{A}) – *P*_{D1}*P*_{A}}/2 and the penetrance *f* = 0.5 at each model. The pattern of the power in figure figure33 is very similar to that of the power in figure figure22 across the six models except ‘Dom Rec’ which achieves the second highest power when *p*_{A} > 0.2, whereas ‘Rec Rec’ is the second highest power across the full range of penetrance in figure figure22.

Analytical power against penetrance *f* at 1% significance level. *P*_{D1} = 0.5; *P*_{D2} = 0.5; *P*_{A} = 0.5; *P*_{B} = 0.5; *D*_{AD2} = *D*_{BD2} = 0.05; the sample size of cases *R* = 1,000 and controls *S* = 1,000. Dominant or Recessive model is indicated by ‘Dom **...**

Analytical power against frequency of allele A at marker M_{1} when *P*_{a} = 0.5 at 1% significance level. *P*_{D1} = 0.5; *P*_{D2} = 0.5; *P*_{B} = 0.5; *D*_{AD1} = {min(*P*_{D1}, *P*_{A}) – *P*_{D1}, *P*_{A}}/2; *D*_{BD2} = 0.05; penetrance *f* = 0.5; the sample size of cases *R* = 1,000 and controls **...**

Figure Figure44 illustrates the power against a linkage disequilibrium coefficient *D*_{AD1} with a fixed value of *D*_{BD2} = 0.05. The power is calculated under *f* = 0.5, *P*_{A} = *P*_{a} = 0.5, *P*_{B} = *P*_{b} = 0.5 and sample size 1,000 cases and 1,000 controls. The powers of ‘Dom Dom’ and ‘Threshold’ were identical and the power curves over the 6 models are very similar to those in figure figure22 and and3.3. Figure Figure55 shows the power against the disease allelic frequency *p*_{D1} at *D**_{1} when *p*_{D2} is fixed *p*_{D2} = 0.5, which have the same parameters as figure figure44 except *D*_{AD1} = {min(*P*_{D1}, *P*_{A}) – *P*_{D1PA}}/2. Likewise, the power of ‘Dom ∩ Dom’ is the highest and ‘Dom Rec’ outperforms ‘Threshold’. Surprisingly, the power of the ‘Modified’ model was the third highest power at *p*_{D1} < 0.32, but it leveled off at a power of approximately 0.6.

Analytical power against frequency of disease allele *D*_{1} when *P*_{D2} = 0.5 at 1% significance level. *D*_{AD1} = {min(*P*_{D1}, *P*_{A}) – *P*_{D1}, *P*_{A}}/2, *D*_{AD1} = *D*_{BD2} = 0.05; penetrance *f* = 0.5; the sample size of cases *R* = 1,000 and controls *S* = 1,000; the notations **...**

Figure Figure66 shows the required sample size for cases and controls across the full range of penetrance *f* to achieve 80% power at 1% significance level when we assume the ratio of cases to controls is 1 (equal size). The sample size is calculated under *P*_{A} = *P*_{a} = 0.5; *P*_{B} = *P*_{b} = 0.5 and *D*_{AD1} = *D*_{BD2} = 0.05. As shown in figure figure6,6, at *f* > 0.5, all models need less than 1,500 case (1,500 control) subjects except the ‘Modified’, ‘Dom Dom’, and ‘Threshold’ model which need less than 1,000 case subjects. The ‘Modified’ model needs the largest sample size over the full range of *f*, followed by ‘Dom Rec’ and ‘Rec Rec’ models.

In order to compare the score test derived from the logistic regression with the F test and LRT from CA regression, we performed simulation studies using SNaP with two scenarios. The first scenario was defined by common variants of SNPs with MAFs at the two loci of 0.3 (*P*_{A} = *P*_{B} = 0.3), disease-associated allele frequencies of 0.2 (*P*_{D1} = *P*_{D2} = 0.2) for a more realistic situation. The LD between a disease locus and a marker is *D*^{′}_{AD1} = *D*^{′}_{BD2} = 0.6; *R*^{2}_{AD1} = *R*^{2}_{BD2} = 0.1. The second scenario was defined by rare variants of SNPs with a minor allelic frequency at the two loci of 0.05 (*P*_{A} = *P*_{B} = 0.05) and disease-associated allele frequencies of 0.5 (*P*_{D1} = *P*_{D2} = 0.5) that are increased due to the floor effect to have the similar LD, *D*^{′}_{AD1} = *D*^{′}_{BD2} = 0.6; *R*^{2}_{AD1} = *R*^{2}_{BD2} = 0.3. Both scenarios have a penetrance *f* = 0.2. For each model 2,000 datasets were simulated. Each dataset has 200 cases and 200 controls. Table Table55 shows the empirical power of the score test, F and LRT at the 1 and 5% significance level in the common variants; table table66 shows the results of the rare variants of SNPs. The power of the three tests in both scenarios is asymptotically equivalent across the 6 interaction models.

Power (%) at 1% and 5% significance level at *D*^{′}_{AQ}_{1} = *D*^{′}_{BQ}_{2} = 0.6; *R*^{2}_{AQ}_{1} = *R*^{2}_{BQ}_{2} = 0.1 (*P*_{A} = P_{B} = 0.3) from 2,000 datasets, each has 200 cases and 200 controls

We have compared the allelic-based approach with two genotypic-based approaches. One is a score test of a genotypic-based model using a genotypic score in a logistic regression framework as follows.

$$\begin{array}{l}log\text{it}\left[p\left({y}_{i}=1|{w}_{i}\gamma ,{Z}_{i,k=\left(AA,BB\right)},\dots ,{Z}_{i,k=\left(aa,bb\right)}\right)\right]\\ =\alpha +{w}_{i}\gamma +\sum _{k=1}^{8}{Z}_{i,k=\left(jl,mn\right)}{\beta}_{k},\end{array}$$

where *j*, *l* = (*A*, or *a*), *m*, *n* = (*B*, or *b*) and

$${Z}_{i,k=\left(jl,mn\right)}=\{\begin{array}{l}1\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}if\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}{G}_{k}=\left(jl,mn\right)\\ 0\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}otherwise\end{array}.$$

Under the null hypothesis of no interaction between two unlinked loci, *H*_{0} : β_{AABB} = … = β_{aaBb} = 0. The score test can be derived from the logistic regression, which is approximately equivalent with the F and LRT from the CA regression. The other genotypic-based method we have compared is multifactor dimensionality reduction (MDR). Due to the use of cross-validation to calculate a prediction accuracy, it is not directly comparable [17]. Therefore, we used a chi-square test of homogeneity for high risk and low risk groups between cases and controls and estimated empirical type I error rates from the 5,000 null datasets and the empirical power of each model from the 2,000 datasets after controlling for the type I error rates at each model in both scenarios [17]. Tables Tables55 and and66 presented the power of the score test for the allelic-based method and the genotypic-based methods. The tables illustrate that the allelic-based method outperformed both a genotypic-based method using a genotypic score and the MDR.

We applied our method to Major Depression Disorder (MDD) case and control data provided by dbGaP [6]. The 1,754 cases with the disease endpoint of diagnosis and 1,763 controls without the MDD were genotyped on the Perlegen 600K platform. We selected 7 candidate genes related with a serotonergic system and a brain-derived neurotrophic factor (BDNF) associated with a pathway of MDD. There were 78 SNPs available from the 7 candidate genes (*SLC6A4, HTR2A, TPH1, TPH2, ITGB3, COMT* and *BDNF*). As quality control procedures, SNPs that have significant deviation from Hardy-Weinberg equilibrium (p value < 6.0 × 10^{−4}) and a minor allele frequency less than 0.01 were removed. 64 tag SNPs using Haploview were selected to reduce the number of tests due to the issue of multiple comparison. The procedure for the proposed method to search for genes associated with the MDD through their interaction consists of multiple step. First, two genes among the 7 genes were selected, then we performed a two-way interaction analysis with an SNP from each gene. All SNPs in the two unlinked genes were tested in a pair-wise fashion. 1,470 SNP pair combinations from 21 two-gene combinations were tested. At each two-way interaction, we performed a score test, an F test and an LRT. Only significant SNP pairs that have p values less than 3.4 × 10^{−5} after adjustment for Bonferroni correction were compared with a model consisting only of a main effect of each SNP to distinguish real allelic interactions. The criteria to select the best model between an interaction model and a main effect model is (1) the p value of the global test is less than 3.4 × 10^{−5} and (2) the p value of the model comparison between the interaction model and the main effect model is less than 0.01, and (3) the smaller Akaike Information Criterion (AIC) from the following models.

$$\begin{array}{l}\text{Main effect}:{H}_{A1}:\text{logit}\left[p\left({y}_{i}=1|{w}_{i},{X}_{i,A},{X}_{i,B}\right)\right]=\\ \mu +{w}_{i}\gamma +{X}_{i,A}{\beta}_{A}+{X}_{i,B}{\beta}_{B}\end{array}$$

$$\begin{array}{l}\text{Two-way}:{H}_{A2}:\text{logit}\left[p\left({y}_{i}=1|{w}_{i},{X}_{i,AB},{X}_{i,Ab},{X}_{i,aB}\right)\right]=\\ \mu +{w}_{i}\gamma +{X}_{i,AB}{\beta}_{AB}+{X}_{i,Ab}{\beta}_{Ab}+{X}_{i,aB}{\beta}_{aB}\end{array}$$

Second, SNP pairs detected by an interaction at the first step were considered for three-way interaction analysis by adding an SNP from one of the 5 genes that were not in the two-way interaction model. This procedure is called a ‘forward selection’ procedure. Similarly to the two-way interaction analysis, we carried out a comparison test of the significant three-way interaction model with a main effects model. The best three-way interaction model was selected by the same three criteria. Finally, these steps were continued until no further high dimensional interaction models are detected.

Based on the 7 selected genes with 64 tagging SNPs, we identified two interacting SNPs: rs11030104 in *BDNF* gene and rs9526240 in *HTR2A* gene (p value = 2.3 × 10^{−5} for a score and LRT test; p value = 8.11 × 10^{−6} for a comparison test between *H*_{A2} of a two-way interaction model and *H*_{A1} of a main effects model). Two pairs of SNPs, rs6265 and rs11030104 in *BDNF*, and rs9526240 and rs6561335 in *HTR2A* are in LD with R^{2} = 0.89 and R^{2} = 0.8, respectively. These 4 pair-wise SNPs are also significantly interacting (p values are between 2.4 × 10^{−5} and 2.6 × 10^{−5}). Based on the interacting SNPs, we performed a three-way interaction analysis and detected a three-way interaction of rs11030104 in *BDNF,* rs9526240 in *HTR2A* and rs652458 in *TPH1* genes (p value for the interaction model is 5.0 × 10^{−5}; p value for the comparison with a main effect is 0.002). The interaction of *BDNF* and *HTR2A* was identified in schizophrenia [18] and in unipolar depression [19]. The association of *SLC6A4*, *HTR2A* was detected in the MDD [20].

In this study, a novel statistical approach to identify gene-gene interaction contributing to a disease trait in a case-control sampling design was proposed. In contrast with most of the currently available methods detecting interaction at the genotypic level, the proposed method searches for interaction at the allelic level with a new definition of interaction. Interaction occurs when the contribution to disease of a particular allele inherited in one gene depends on a particular allele inherited in other unlinked genes. Due to the new definition, the interpretation of result may be more straightforward at the allelic rather than at the genotypic level because the interacting alleles can be explained by combinations of disease-associated alleles. The score test derived from a logistic regression model is computationally efficient because one does not need maximum likelihood estimations. As shown in the simulation studies, the power of the score test is asymptotically equivalent to that of the F test and LRT derived from the CA regression. Furthermore, our approach allows an adjustment for non-genetic covariates as nuisance parameters, which may be necessary in complex trait analysis.

The terms ‘gene-gene interaction’ or ‘epistasis’ are used with different meanings within different areas of genetics [6]. Recently, Phillips reviewed the conceptual definition and the differences, and also pointed out that there are conceptual barriers to generating a more unified definition [6]. Epistasis was originally defined as a mutation (or an allele) that masks or stimulates the effects of the other mutation on a phenotype. Gene-gene interaction is a broad concept that encompasses a variety of genetic phenomena. In this paper, we use gene-gene interaction in a general way to involve the epistasis. Detection of the genetic interaction at the allelic level is conceptually similar to epistasis for which the biological interpretation is straightforward.

For the high-order interaction of multiple markers, the issue of multiple comparisons still remains either in independent SNPs or in dependent SNPs due to multiple procedures to distinguish the real allelic interaction from the genetic effects. As illustrated in the application with MDD, each procedure involves testing two or three hypotheses and many combinations of SNPs involve testing many hypothesis testings. Bonferroni correction for the proposed interaction models was used to find the significant sets of SNPs at each step depending on the number of SNPs being analyzed. For each comparison test, a threshold of 0.01 or less was used for the significance level. It has been known that the Bonferroni correction often produces conservative results when LD is ignored. Permutation analysis is an alternative approach to generate an empirical null distribution of each step; however, it may not be computationally feasible for real data applications because of the multiple hierarchical steps for high dimensional interactions. Some machine learning approaches use cross-validation to predict an accuracy rate and permutation analysis to estimate p value, both of which are computationally expensive.

Searching for gene-gene interaction is a necessary procedure in a genome-wide association study because a disease is a complicated network of multiple genes. Effort to develop computationally efficient methods to detect the interacting SNPs at the whole genome level is underway. Future work should extend this approach to the genomewide association study. Alternatively, an incorporation of biological knowledge of gene pathway to prioritize biologically important genes may reduce the search space and increase power to detect interacting genes associated with a disease.

We would like to thank two anonymous reviewers and Peter H. Baenziger for thoughtful critiques that have resulted in a much improved manuscript. This work was supported by NIH/NHLBI R01 HL095086-01. Funding support for Major Depression: Stage 1 Genomewide Association in Population-Based Samples was provided by NIH and the genotyping of samples was provided through the Genetic Association Information Network (GAIN). The dataset(s) used for the analyses described in this manuscript were obtained from the GAIN Database found at http://view.ncbi.nlm.nih.gov/dbgap-controlled through dbGaP accession number phs000020.v1.p1. Samples and associated phenotype data for Major Depression: Stage 1 Genomewide Association in Population-Based Samples were provided by P. Sullivan.

Let *U*^{τ} = (*U*_{AB}, *U*_{Ab}, *U*_{aB})^{τ} be a score function, which is the derivative of the log-likelihood function with respect to β = (β_{AB}, β_{Ab}, β_{ab}) respectively

$${U}^{\tau}={U}_{\beta}\left({\beta}_{{H}_{0}},\stackrel{\u02c6}{\alpha}\right)={\left({U}_{AB},{U}_{Ab},{U}_{aB}\right)}^{\tau}={\left(\frac{\partial logL}{\partial {\beta}_{AB}},\frac{\partial logL}{\partial {\beta}_{Ab}},\frac{\partial logL}{\partial {\beta}_{aB}}\right)}^{\tau}={\left(\sum {x}_{i,AB}\left({y}_{i}-\overline{y}\right),\sum {x}_{i,Ab}\left({y}_{i}-\overline{y}\right)\sum {x}_{i,aB}\left({y}_{i}-\overline{y}\right)\right)}^{\tau}$$

Let *I* (α, β_{AB}, β_{Ab}, β_{aB}) be the observed Fisher information matrix, which is the second derivative of the log-likelihood function with respect to (α, β_{AB}, β_{Ab}, β_{aB}) respectively

$$I\left(\alpha ,{\beta}_{AB},{\beta}_{Ab},{\beta}_{aB}\right)=\overline{y}\left(1-\overline{y}\right)\left(\begin{array}{llll}n\hfill & \sum {x}_{i,AB}\hfill & \sum {x}_{i,Ab}\hfill & \sum {x}_{i,aB}\hfill \\ \sum {x}_{i,AB}\hfill & \sum {x}_{i,AB}^{2}\hfill & \sum {x}_{i,AB}{x}_{i,Ab}\hfill & \sum {x}_{i,AB}{x}_{i,aB}\hfill \\ \sum {x}_{i,Ab}\hfill & \sum {x}_{i,AB}{x}_{i,Ab}\hfill & \sum {x}_{i,Ab}^{2}\hfill & \sum {x}_{i,Ab}{x}_{i,aB}\hfill \\ \sum {x}_{i,aB}\hfill & \sum {x}_{i,AB}{x}_{i,aB}\hfill & \sum {x}_{i,Ab}{x}_{i,aB}\hfill & \sum {x}_{i,aB}^{2}\hfill \end{array}\right)=\left(\begin{array}{cc}{I}_{\alpha \alpha}& {I}_{\alpha \beta}\\ {I}_{\beta \alpha}& {I}_{\beta \beta}\end{array}\right),$$

where *I*_{αα} = ^{2} log *L*/α2, *I*_{αβ} = ^{2} log *L*/α, *I*_{ββ} = ^{2} log *L*/β^{2}.

Under the assumption of no covariates for simplification, we calculate non-centrality parameter λ_{score} = Ω^{τ}Σ^{−1}Ω where Ω and Σ are the mean and covariance matrix of *U** respectively. Let *X* = (*X*_{AB}, *X*_{Ab}, *X*_{aB}, *X*_{ab}), where *X*_{AB} = (*X*_{1,AB}, *X*_{2,AB}, …, *X*_{n,AB})^{τ}. *y*_{i} = 1 if a subject is affected by a disease and *y*_{i} = 0 if not affected. Let *f*_{klmn} = *P*(*y* = 1 | *D**_{1}, *D**_{2}) be the joint penetrance for the two disease loci *D**_{1} and *D**_{2}, *k*, *l* = (*D*_{1}, *d*_{1}), *m*, *n* = (*D*_{2}, *d*_{2}).

$$\begin{array}{l}\begin{array}{c}\begin{array}{l}E\left({X}_{AB}y\right)={\sum}_{klmn}{X}_{AB}\times y\times P\left(y=1,{D}_{1}^{*}={D}_{k}{D}_{l},{D}_{2}^{*}={D}_{m}{D}_{n},{M}_{1},{M}_{2}\right)\hfill \end{array}\end{array}\\ ={\sum}_{klmn}{x}_{AB}\times P\left(y=1|{D}_{1}^{*},{D}_{2}^{*},{M}_{1},{M}_{2}\right)\times P\left({D}_{1}^{*},{D}_{2}^{*},{M}_{1},{M}_{2}\right)\\ =1\times \left\{{f}_{1111}\times P\left({D}_{1}^{*}={D}_{1}{D}_{1},{D}_{2}^{*}={D}_{2}{D}_{2},{M}_{1}=AA,{M}_{2}=BB\right)+\cdots +{f}_{2222}\times P\left({d}_{1}{d}_{1},{d}_{2}{d}_{2},AA,BB\right)\right\}\\ +\frac{1}{2}\times \left\{{f}_{1111}\times P\left({D}_{1}{D}_{1},{D}_{2}{D}_{2},AA,Bb\right)+\cdots +{f}_{2222}\times P\left({d}_{1}{d}_{1},{d}_{2}{d}_{2},AA,Bb\right)\right\}\\ +\frac{1}{2}\times \left\{{f}_{1111}\times P\left({D}_{1}{D}_{1},{D}_{2}{D}_{2},Aa,BB\right)+\cdots +{f}_{2222}\times P\left({d}_{1}{d}_{1},{d}_{2}{d}_{2},Aa,BB\right)\right\}\\ +\frac{1}{4}\times \left\{{f}_{1111}\times P\left({D}_{1}{D}_{1},{D}_{2}{D}_{2},Aa,Bb\right)+\cdots +{f}_{2222}\times P\left({d}_{1}{d}_{1},{d}_{2}{d}_{2},Aa,Bb\right)\right\}\\ ={f}_{1111}\left\{p{\left(A{D}_{1}\right)}^{2}P{\left(B{D}_{2}\right)}^{2}+\frac{1}{2}\times 2p{\left(A{D}_{1}\right)}^{2}P\left(B{D}_{2}\right)P\left(b{D}_{2}\right)+\frac{1}{2}\times 2p\left(A{D}_{1}\right)p\left(a{D}_{1}\right)P{\left(B{D}_{2}\right)}^{2}+\frac{1}{4}\times 4p\left(A{D}_{1}\right)p\left(a{D}_{1}\right)P\left(B{D}_{2}\right)P\left(b{D}_{2}\right)\right\}\\ +\cdots +{f}_{2222}\left\{p{\left(A{d}_{1}\right)}^{2}P{\left(B{d}_{2}\right)}^{2}+\frac{1}{2}\times 2p{\left(A{d}_{1}\right)}^{2}P\left(B{d}_{2}\right)P\left(b{d}_{2}\right)+\frac{1}{2}\times 2p\left(A{d}_{1}\right)p\left(a{d}_{1}\right)P{\left(B{d}_{2}\right)}^{2}+\frac{1}{4}\times 4p\left(A{d}_{1}\right)p\left(a{d}_{1}\right)P\left(B{d}_{2}\right)P\left(b{d}_{2}\right)\right\}\end{array}$$

$$\begin{array}{l}E\left({X}_{AB}y\right)=P\left(A{D}_{1}\right)P\left(B{D}_{2}\right)\left\{{f}_{1111}P\left({D}_{1}\right)P\left({D}_{2}\right)+{f}_{1112}P\left({D}_{1}\right)P\left({d}_{2}\right)+{f}_{1211}P\left({d}_{1}\right)P\left({D}_{2}\right)+{f}_{1212}P\left({d}_{1}\right)P\left({d}_{2}\right)\right\}\\ +P\left(A{D}_{1}\right)P\left(B{d}_{2}\right)\left\{{f}_{1112}P\left({D}_{1}\right)P\left({D}_{2}\right)+{f}_{1122}P\left({D}_{1}\right)P\left({d}_{2}\right)+{f}_{1212}P\left({d}_{1}\right)P\left({d}_{2}\right)+{f}_{1222}P\left({d}_{1}\right)P\left({d}_{2}\right)\right\}\\ +P\left(A{d}_{1}\right)P\left(B{D}_{2}\right)\left\{{f}_{1211}P\left({D}_{1}\right)P\left({D}_{2}\right)+{f}_{1212}P\left({D}_{1}\right)P\left({d}_{2}\right)+{f}_{2211}P\left({d}_{1}\right)P\left({D}_{2}\right)+{f}_{2212}P\left({d}_{1}\right)P\left({d}_{2}\right)\right\}\\ +P\left(A{d}_{1}\right)P\left(B{d}_{2}\right)\left\{{f}_{1212}P\left({D}_{1}\right)P\left({D}_{2}\right)+{f}_{1222}P\left({D}_{1}\right)P\left({d}_{2}\right)+{f}_{2212}P\left({d}_{1}\right)P\left({D}_{2}\right)+{f}_{2222}P\left({d}_{1}\right)P\left({d}_{2}\right)\right\}\end{array}$$

Utilizing *P* (*AD*_{1}) = *D*_{AD1} + *P*_{A}*P*_{D1}; *P* (*Ad*_{1}) = *D*_{Ad1} + *P*_{A}*P*_{d1}; *P*(*BD*_{2}) = *D*_{BD2} + *P*_{B}*P*_{D2}; *P*(*bd*_{2}) = *D*_{bd2} + *P*_{b}*P*_{D2} and *f*_{D1D1} = *f*_{1111}*P*(*D*_{1})*P*(*D*_{2}) + *f*_{1112}*P*(*D*_{1})*P*(*d*_{2}) + *f*_{1211}*P*(*d*_{1}) *P* (*D*_{2}) + *f*_{1212}*P* (*d*_{1}) *P* (*d*_{2}), we derived

$$\begin{array}{l}E\left({X}_{AB}y\right)=\left({D}_{A{D}_{1}}+{P}_{A}{P}_{{D}_{1}}\right)\times \left({D}_{B{D}_{2}}+{P}_{B}{P}_{{D}_{2}}\right){f}_{{D}_{1}{D}_{2}}+\cdots +\left({D}_{A{d}_{1}}+{P}_{A}{P}_{{d}_{1}}\right)\times \left({D}_{B{d}_{2}}+{P}_{B}{P}_{{d}_{2}}\right){f}_{{d}_{1}{d}_{2}}\\ =f{P}_{A}{P}_{B}+\left({f}_{{D}_{1}{D}_{2}}-{f}_{{D}_{1}{d}_{2}}-{f}_{{d}_{1}{D}_{2}}+{f}_{{d}_{1}{d}_{2}}\right){D}_{A{D}_{1}}{D}_{B{D}_{2}}+\left({f}_{{D}_{1}}-{f}_{{d}_{1}}\right){D}_{A{D}_{1}}{P}_{B}+\left({f}_{{D}_{2}}-{f}_{qd}\right){D}_{B{D}_{1}}{P}_{A}.\end{array}$$

*E* (*X*_{Ab}*y*), *E* (*XaBy*) can be calculated in the same way.

Using *E* (*X*_{AB}) = *P*_{A}*P*_{B}, *E* (*X*_{Ab}) = *P*_{A}*P*_{b}, *E* (*X*_{aB}) = *P*_{a}*P*_{B}, *E*(*X*_{ab}) = *P*_{a}*P*_{b}, *E*(*X*^{2}_{AB}) = *P*_{A}*P*_{B} (1 − 0.5 *P*_{a})(1 − 0.5 *P*_{b}), *E*(*X*_{AB}*X*_{Ab}) = 0.5 *P*_{A}*P*_{B}*P*_{b} (1 − 0.5 *P*_{a}), …, *E* (*X*_{aB}*X*_{ab}) = 1/4 *P*_{a}*P*_{B}*P*_{b} (1 − 2 *P*_{A}),

$$E\left(\begin{array}{lll}\sum {x}_{i,AB}^{2}\hfill & \sum {x}_{i,AB}{x}_{i,Ab}\hfill & \sum {x}_{i,AB}{x}_{i,aB}\hfill \\ \sum {x}_{i,AB}{x}_{i,Ab}\hfill & \sum {x}_{i,Ab}^{2}\hfill & \sum {x}_{i,Ab}{x}_{i,aB}\hfill \\ \sum {x}_{i,AB}{x}_{i,aB}\hfill & \sum {x}_{i,Ab}{x}_{i,aB}\hfill & \sum {x}_{i,aB}^{2}\hfill \end{array}\right)=N\frac{R}{N}\left(1-\frac{R}{N}\right)\left(\begin{array}{lll}{P}_{A}{P}_{B}\left(1-0.5{P}_{a}\right)\left(1-0.5{P}_{b}\right)\hfill & 0.5{P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{a}\right)\hfill & 0.5{P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{b}\right)\hfill \\ 0.5{P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{a}\right)\hfill & {P}_{A}{P}_{b}\left(1-0.5{P}_{a}\right)\left(1-0.5{P}_{B}\right)\hfill & 0.25{P}_{A}{P}_{B}{P}_{a}{P}_{b}\hfill \\ 0.5{P}_{A}{P}_{B}{P}_{a}\left(1-0.5{P}_{b}\right)\hfill & 0.25{P}_{A}{P}_{B}{P}_{a}{P}_{b}\hfill & {P}_{a}{P}_{B}\left(1-0.5{P}_{A}\right)\left(1-0.5{P}_{b}\right)\hfill \end{array}\right)$$

Therefore, and can be expressed as follows.

$$\begin{array}{l}\Omega =E\left({U}^{*}\right)=E{\left(\left(\begin{array}{c}{\sum}_{i}{y}_{i}{x}_{i,AB}\\ {\sum}_{i}{y}_{i}{x}_{i,Ab}\\ {\sum}_{i}{y}_{i}{x}_{i,aB}\end{array}\right)-\overline{y}\left(\begin{array}{c}{\sum}_{i}{x}_{i,AB}\\ {\sum}_{i}{x}_{i,Ab}\\ {\sum}_{i}{x}_{i,aB}\end{array}\right)\right)}^{\tau}\\ =\left[\frac{R}{N}\left(\begin{array}{c}{P}_{A}{P}_{B}\\ {P}_{A}{P}_{b}\\ {P}_{a}{P}_{B}\end{array}\right)+\left({f}_{{D}_{1}{D}_{2}}-{f}_{{D}_{1}{d}_{2}}-{f}_{{d}_{1}{D}_{2}}+{f}_{{d}_{1}{d}_{2}}\right)\left(\begin{array}{c}{D}_{A{D}_{1}}{D}_{B{D}_{2}}\\ {D}_{A{D}_{1}}{D}_{b{D}_{2}}\\ {D}_{a{D}_{1}}{D}_{B{D}_{2}}\end{array}\right)+\left({f}_{{D}_{1}}-{f}_{{d}_{1}}\right)\left(\begin{array}{c}{D}_{A{D}_{1}}{P}_{B}\\ {D}_{A{D}_{1}}{P}_{b}\\ {D}_{a{D}_{1}}{P}_{B}\end{array}\right)+\left({f}_{{D}_{2}}-{f}_{{d}_{2}}\right)\left(\begin{array}{c}{D}_{B{D}_{2}}{P}_{A}\\ {D}_{bD2}{P}_{A}\\ {D}_{B{D}_{2}}{P}_{a}\end{array}\right)-\frac{R}{N}\left(\begin{array}{c}{P}_{A}{P}_{B}\\ {P}_{A}{P}_{b}\\ {P}_{a}{P}_{B}\end{array}\right)\right]\\ =\left[\left({f}_{{D}_{1}{D}_{2}}-{f}_{{D}_{1}{d}_{2}}-{f}_{{d}_{1}{D}_{2}}-{f}_{{d}_{1}{d}_{2}}\right)\left(\begin{array}{c}{D}_{A{D}_{1}}{D}_{B{D}_{2}}\\ {D}_{A{D}_{1}}{D}_{b{D}_{2}}\\ {D}_{a{D}_{1}}{D}_{B{D}_{2}}\end{array}\right)+\left({f}_{{D}_{1}}-{f}_{{d}_{1}}\right)\left(\begin{array}{c}{D}_{A{D}_{1}}{P}_{B}\\ {D}_{A{D}_{1}}{P}_{b}\\ {D}_{a{D}_{1}}{P}_{B}\end{array}\right)+\left({f}_{{D}_{2}}-{f}_{{d}_{2}}\right)\left(\begin{array}{c}{D}_{B{D}_{2}}{P}_{A}\\ {D}_{b{D}_{2}}{P}_{A}\\ {D}_{B{D}_{2}}{P}_{a}\end{array}\right)\right]\end{array}$$

$$\begin{array}{l}\Sigma =V\left({U}^{*}\right)={I}_{\beta \beta}-{I}_{\beta \alpha}{I}_{\alpha \alpha}^{-1}{I}_{\alpha \beta}\\ =\overline{y}\left(1-\overline{y}\right)\left(\begin{array}{lll}\sum {x}_{i,AB}^{2}\hfill & \sum {x}_{i,AB}{x}_{i,Ab}\hfill & \sum {x}_{i,AB}{x}_{i,aB}\hfill \\ \sum {x}_{i,AB}{x}_{i,Ab}\hfill & \sum {x}_{i,Ab}^{2}\hfill & \sum {x}_{i,Ab}{x}_{i,aB}\hfill \\ \sum {x}_{i,AB}{x}_{i,aB}\hfill & \sum {x}_{i,Ab}{x}_{i,aB}\hfill & \sum {x}_{i,aB}^{2}\hfill \end{array}\right)-\frac{1}{N}{\overline{y}}^{2}{\left(1-\overline{y}\right)}^{2}\left(\begin{array}{c}\sum {x}_{i,AB}\\ \sum {x}_{i,Ab}\\ \sum {x}_{i,aB}\end{array}\right)\left(\sum {x}_{i,AB}\sum {x}_{i,Ab}\sum {x}_{i,aB}\right)\\ =N\frac{R}{N}\left(1-\frac{R}{N}\right)\left[\left(\begin{array}{lll}{P}_{A}{P}_{B}\left(1-0.5{P}_{a}\right)\left(1-0.5{P}_{b}\right)\hfill & {P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{a}\right)\hfill & {P}_{A}{P}_{B}{P}_{a}\left(1-0.5{P}_{b}\right)\hfill \\ {P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{a}\right)\hfill & {P}_{A}{P}_{b}\left(1-0.5{P}_{a}\right)\left(1-0.5{P}_{B}\right)\hfill & {P}_{A}{P}_{B}{P}_{a}{P}_{b}\hfill \\ {P}_{A}{P}_{B}{P}_{a}\left(1-0.5{P}_{b}\right)\hfill & {P}_{A}{P}_{B}{P}_{a}{P}_{b}\hfill & {P}_{a}{P}_{B}\left(1-0.5{P}_{A}\right)\left(1-0.5{P}_{b}\right)\hfill \end{array}\right)-\frac{R}{R}\left(1-\frac{R}{N}\right)\left(\begin{array}{c}{P}_{A}{P}_{B}\\ {P}_{A}{P}_{b}\\ {P}_{a}{P}_{B}\end{array}\right)\left(\begin{array}{ccc}{P}_{A}{P}_{B}& {P}_{A}{P}_{b}& {P}_{a}{P}_{B}\end{array}\right)\right].\end{array}$$

We multiply both sides by *X* = (*X*_{AB}, *X*_{Ab}, *X*_{aB}, *X*_{ab}), where *X*_{AB} = (*X*_{1,AB}, *X*_{2,AB}, …, *X*_{n,AB})^{τ} and take the expectation in order to derive regression coefficients in equation (3)

$$E\left(\begin{array}{c}{X}_{AB}y\\ {X}_{Ab}y\\ {X}_{aB}y\\ {X}_{ab}y\end{array}\right)=E\left(\begin{array}{llll}{X}_{AB}^{2}\hfill & {X}_{AB}{X}_{Ab}\hfill & {X}_{AB}{X}_{aB}\hfill & {X}_{AB}{X}_{ab}\hfill \\ {X}_{Ab}{X}_{AB}\hfill & {X}_{Ab}^{2}\hfill & {X}_{Ab}{X}_{aB}\hfill & {X}_{Ab}{X}_{ab}\hfill \\ {X}_{aB}{X}_{AB}\hfill & {X}_{aB}{X}_{Ab}\hfill & {X}_{aB}^{2}\hfill & {X}_{aB}{X}_{ab}\hfill \\ {X}_{ab}{X}_{AB}\hfill & {X}_{ab}{X}_{Ab}\hfill & {X}_{ab}{X}_{aB}\hfill & {X}_{ab}^{2}\hfill \end{array}\right)\times \left(\begin{array}{c}{\beta}_{AB}\\ {\beta}_{Ab}\\ \vdots \\ {\beta}_{ab}\end{array}\right)$$

The expectation of each element in matrix *E* (*X′ X*) is calculated with the assumption of no covariates as shown in Appendix A.

$$\begin{array}{l}E\left({X}_{AB}y\right)={\sum}_{klmn}{X}_{AB}\times y\times P\left(y=1,{D}_{1}^{*}={D}_{k}{D}_{l},{D}_{2}^{*}={D}_{m}{D}_{n},{M}_{1},{M}_{2}\right)\\ ={\sum}_{klmn}{x}_{AB}\times P\left(y=1|{D}_{1}^{*},{D}_{2}^{*},{M}_{1},{M}_{2}\right)\times P\left({D}_{1}^{*},{D}_{2}^{*},{M}_{1},{M}_{2}\right)\\ =\left({D}_{A{D}_{1}}+{P}_{A}{P}_{{D}_{1}}\right)\times \left({D}_{B{D}_{2}}+{P}_{B}{P}_{{D}_{2}}\right){f}_{{D}_{1}{D}_{2}}+\mathrm{\dots}+\left({D}_{A{d}_{1}}+{P}_{A}{P}_{{d}_{1}}\right)\times \left({D}_{B{d}_{2}}+{P}_{B}{P}_{{d}_{2}}\right){f}_{{d}_{1}{d}_{2}}\\ =f{P}_{A}{P}_{B}+\left({f}_{{D}_{1}{D}_{2}}-{f}_{{D}_{1}{d}_{2}}-{f}_{{d}_{1}{d}_{2}}\right){D}_{A{D}_{1}}{D}_{B{D}_{2}}+\left({f}_{{D}_{1}}-{f}_{{d}_{1}}\right){D}_{A{D}_{1}}{P}_{B}+\left({f}_{{D}_{2}}-{f}_{qd}\right){D}_{B{D}_{1}}{P}_{A}\\ \beta =\left(\begin{array}{c}\alpha \\ {\beta}_{AB}\\ {\beta}_{Ab}\\ {\beta}_{aB}\end{array}\right)={K}^{-1}\left[\frac{R}{N}\left(\begin{array}{c}1\\ {P}_{A}{P}_{B}\\ {P}_{A}{P}_{b}\\ {P}_{a}{P}_{B}\end{array}\right)+\left({f}_{{D}_{1}{D}_{2}}-{f}_{{D}_{1}{d}_{2}}-{f}_{{d}_{1}{D}_{2}}+{f}_{{d}_{1}{d}_{2}}\right)\left(\begin{array}{c}0\\ {D}_{A{D}_{1}}{D}_{B{D}_{2}}\\ {D}_{A{D}_{1}}{D}_{b{D}_{2}}\\ {D}_{a{D}_{1}}{D}_{B{D}_{2}}\end{array}\right)+\left({f}_{{D}_{1}}-{f}_{{d}_{1}}\right)\left(\begin{array}{c}0\\ {D}_{A{D}_{1}}{P}_{B}\\ {D}_{A{D}_{1}}{P}_{b}\\ {D}_{a{D}_{1}}{P}_{B}\end{array}\right)+\left({f}_{{D}_{2}}-{f}_{{d}_{2}}\right)\left(\begin{array}{c}0\\ {D}_{B{D}_{2}}{P}_{A}\\ {D}_{b{D}_{2}}{P}_{A}\\ {D}_{B{D}_{2}}{P}_{a}\end{array}\right)\right],\end{array}$$

(2)

where

$$K\left(\begin{array}{llll}1\hfill & {P}_{A}{P}_{B}\hfill & {P}_{A}{P}_{b}\hfill & {P}_{a}{P}_{B}\hfill \\ {P}_{A}{P}_{B}\hfill & {P}_{A}{P}_{B}\left(1-0.5{P}_{a}\right)\left(1-0.5{P}_{b}\right)\hfill & 0.5{P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{a}\right)\hfill & 0.5{P}_{A}{P}_{B}{P}_{a}\left(1-0.5{P}_{b}\right)\hfill \\ {P}_{A}{P}_{b}\hfill & 0.5{P}_{A}{P}_{B}{P}_{b}\left(1-0.5{P}_{a}\right)\hfill & {P}_{A}{P}_{b}\left(1-0.5{P}_{a}\right)\left(1-0.5{P}_{B}\right)\hfill & 0.25{P}_{A}{P}_{B}{P}_{a}{P}_{b}\hfill \\ {P}_{a}{P}_{B}\hfill & 0.5{P}_{A}{P}_{B}{P}_{a}\left(1-0.5{P}_{b}\right)\hfill & 0.25{P}_{A}{P}_{B}{P}_{a}{P}_{b}\hfill & {P}_{a}{P}_{B}\left(1-0.5{P}_{A}\right)\left(1-0.5{P}_{b}\right)\hfill \end{array}\right).$$

1. Ritchie MD, Hahn LW, Roodi N, Bailey R, Dupont WD, Parl FF, Moore J. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69:138–147. [PubMed]

2. Cook NR, Zee RY, Ridker PM. Tree and spline based association analysis of gene × gene interaction models for ischemic stroke. Stat Med. 2004;23:1439–1453. [PubMed]

3. Chen SH, Sun J, Dimitrov L, Turner AR, Adams TS, Meyers DA, Chang BL, Zheng SL, Grönberg H, Xu J, Hsu FC. A support vector machine approach for detecting gene-gene interaction. Genet Epidemiol. 2008;32:152–167. [PubMed]

4. Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol. 2005;28:157–170. [PubMed]

5. Jung J, Sun B, Kwon D, Koller D, Foroud T: Allelic based gene-gene interaction association with quantitative trait loci. Genet Epidemiol, 2009, in press. [PMC free article] [PubMed]

6. Pillips PC. Epistasis – the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev. 2008;9:855–867. [PMC free article] [PubMed]

7. Boomsma DI, Willemsen G, Sullivan PF, Heutink P, Meijer P, Sondervan D, Kluft C, Smit G, Nolen WA, Zitman FG, Smit JH, Hoogendijk WJ, van Dyck R, de Geus EJ, Penninx BW. Genome-wide association of major depression: description of samples for the GAIN Major Depressive Disorder Study: NTR and NESDA biobank projects. Eur J Hum Genet. 2008;16:335–342. [PubMed]

8. Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59:19–35.

9. Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411.

10. Scott AJ, Wild CJ. Fitting regression models to case-control data by maximum likelihood. Biometrika. 1997;84:57–71.

11. Self SG, Mauritsen RH. Power/Sample size calculations for generalized linear models. Biometrics. 1988;44:79–86.

12. Falconer DS, Mackay TFC. Introduction to quantitative genetics. ed 4. London: Longman; 1996.

13. Armitage P. Tests for linear trends in proportions and frequencies. Biometrics. 1955;11:375–386.

14. Cochran WG. Some methods for strengthening the common tests. Biometrics. 1954;10:417–451.

15. McAleer M. Exact tests of a model against nonnested alternatives. Biometrika. 1983;70:285–288.

16. Nothnagel M. Simulation of LD block-structured SNP haplotype data and its use for the analysis of case-control data by supervised learning methods. Am J Hum Genet. 2002;71(suppl):A2363.

17. Millstein J, Conti DV, Gilliland FD, Gauderman WJ. A testing framework for identifying susceptibility genes in the presence of epistasis, Am J Hum Gene. 2006;78:15–27. [PubMed]

18. Alfimove MV, Alfimova MV, Lezheôko TV, Golimbet VE, Korovaôtseva GI, Lavrushkina OM, Kolesina NIU, Frolova LP, Muratova AA, Abramova LI, Kaleda VG. Investigation of association of the brain-derived neurotrophic factor (BDNF) and a serotonin receptor 2A (5-HTR2A) genes with voluntary and involuntary attention in schizophrenia. Zh Nevrol Psikhiatr Im S S Korsakova. 2008;108:62–69. [PubMed]

19. Kotte A, McQuaid JR, Kelsoe JR: Psychotherapeutic mechanisms of change: the role of genes in depression treatment outcome, Abstract, Am Soc Hum Genet, 2007.

20. Lazary J, Lazary A, Gonda X, Benko A, Molnar E, Juhasz G, Bagdy G. New evidence for the association of the serotonin transporter gene (SLC6A4) haplotypes, threatening life events, and depressive phenotype. Biol Psychiatry. 2008;64:498–504. [PubMed]

Articles from Human Heredity are provided here courtesy of **Karger Publishers**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |