Search tips
Search criteria 


Logo of bioinfoLink to Publisher's site
Bioinformatics. 2011 January 1; 27(1): 134–136.
Published online 2010 November 13. doi:  10.1093/bioinformatics/btq616
PMCID: PMC3008644

Automated construction and testing of multi-locus gene–gene associations


Summary: It has been argued that the missing heritability in common diseases may be in part due to rare variants and gene–gene effects. Haplotype analyses provide more power for rare variants and joint analyses across genes can address multi-gene effects. Currently, methods are lacking to perform joint multi-locus association analyses across more than one gene/region. Here, we present a haplotype-mining gene–gene analysis method, which considers multi-locus data for two genes/regions simultaneously. This approach extends our single region haplotype-mining algorithm, hapConstructor, to two genes/regions. It allows construction of multi-locus SNP sets at both genes and tests joint gene–gene effects and interactions between single variants or haplotype combinations. A Monte Carlo framework is used to provide statistical significance assessment of the joint and interaction statistics, thus the method can also be used with related individuals. This tool provides a flexible data-mining approach to identifying gene–gene effects that otherwise is currently unavailable.


Contact: ude.hatu.csh@oba.nayr


Haplotype and gene–gene analyses have been suggested as strategies to identify disease loci that single nucleotide polymorphism (SNP) approaches may have missed (Manolio et al., 2009). Haplotypes have the potential for improved characterization of variation across the locus set (Clark, 2004; Schaid, 2004). Yet, it is usually unclear which haplotypes to test and how to model them. Numerous methods consider all haplotypes spanning the entire locus set, with attempts to reduce the degrees of freedom that this approach otherwise confers (Liu et al., 2007; Tzeng and Zhang, 2007). Other techniques have been designed to analyze contiguous and non-contiguous locus subsets (Abo et al., 2008; Browning, 2006; Browning and Browning, 2007; Laramie et al., 2007; Lin, 2004).

It has been hypothesized (Moore, 2003), and in some cases shown (Combarros et al., 2009), that genetic factors at one gene can modify the effects of another gene on disease susceptibility. If such biological interaction exists, the association may only be evident by considering both genes simultaneously. Gene–gene studies are complicated by issues surrounding what constitutes a gene–gene interaction. For example, some approaches for testing interactions focus on association between two unlinked loci (Wu et al., 2008; Zhao et al., 2006), which do not provide any measure of departure from additivity as a statistical interaction is classically defined.

Most often haplotype analyses are performed for a single region and gene–gene studies concentrate on single SNPs in each region. Methods that consider multi-locus data at more than one gene would be desirable to maximize the ability to detect association evidence. One such method exists to test specific haplotype interactions at unlinked regions (Becker et al., 2005). However, both haplotype and gene–gene analyses can result in high-dimensionality, and how to combine them is therefore a challenging problem.

To address these challenges, we have extended our single region haplotype-mining approach (Abo et al., 2008) to consider multi-locus data at two genes and test for association and interaction. We concentrate on a broad set of tests that considers both joint effects and interaction effects. In our gene–gene-mining process, data considered at each gene can be single or multi-locus. We anticipate that this gene–gene-mining approach will be most useful for hypothesis generation. However, if required, haplotype testing can also be performed using an empirical correction for multiple testing. Case–control and case-only designs are available, in addition to statistics to test joint and interaction effects. The method is implemented in a Monte Carlo (MC) testing framework and empirical construction-wide significance assessment is available for hypothesis testing.


For both genes/regions considered, maximum likelihood estimates (MLE) for all individuals' haplotype pairs and population haplotype frequencies are determined. All SNPs in each region and all individuals with sufficient data at both regions are considered (based on a user-defined genotype call rate threshold). Full-length MLE haplotypes, or sub-haplotypes extracted from them, are the genetic variables considered in the construction and testing process.

Consider h and k loci in unlinked genes, G1 = {M1,…, Mh} and G2 = {Mh+1,…, Mh+k}. The full locus set S = G1 [union or logical sum] G2. First, all single locus association tests are conducted. These single locus associations are assessed against the first significance threshold, T1, which is user-defined. For any locus i with P-value ≤ T1, all locus pairs {Mi, Mj|∀Mj [set membership] S; ji} are considered at the second step. The locus pair {Mi, Mj} is the locus set, L, being considered. When the two loci in L span both genes, gene–gene tests between the loci are performed. When loci in L are all within the same gene, the two loci are tested as a haplotype or composite genotype. Tests at step n are assessed at significance threshold Tn ([set membership] {T1,…, Th+k}), which are usually chosen to be increasing in stringency with n. A locus set can be written as L = {g1 g2|g1 [subset or is implied by] G1 and g2 [subset or is implied by] G2} where g1 denotes loci that reside in G1 and g2 those that reside in G2. In steps n > 2, if there are multiple SNPs in both genes, gene–gene tests between haplotypes across g1 and haplotypes across g2 will be performed. The steps continue until no further locus sets pass the defined threshold values or the full locus sets have been tested.

To avoid a strict uphill climb algorithm, which is susceptible to identifying local minimums, we have incorporated a backward step. At each backward step, the algorithm considers subsets of size n − 1 from the current locus set that were not previously tested. Any subsets which pass the significance threshold, Tn, will be retained and the process will continue forward again.

For locus sets where g1 and/or g2 are multi-locus, haplotypes or composite genotypes are considered. The algorithm considers each haplotype across gi as a potential ‘risk haplotype’, and compares with all other haplotypes grouped together. For any specific haplotype, this reduces the multi-locus data to a biallelic system which can be used for standard allelic, dominant, recessive and additive models for testing both within and across genes. For composite genotype combinations, phase is unimportant, each locus in L is modeled separately as dominant or recessive and the combinations of these considered across loci. Hence, composite genotypes tests can be performed within or across genes.

To reduce the tests performed, at step n + 1 the algorithm only expands the specific risk haplotypes that passed the significance threshold (i.e. the alleles at loci from step n are fixed). A similar rule is applied to the composite genotypes.

Single locus, haplotype and composite genotype models are tested using odds ratios, chi-square and chi-square trend association statistics. For locus sets containing loci in two genes, L = {g1 g2|g1 [subset or is implied by] G1 and g2 [subset or is implied by] G2}, an interaction odds ratio test and a correlation-based statistic are offered to identify gene–gene effects between the two loci sets, g1 and g2. As described above, multi-locus sets within genes are considered using biallelic recoding. We refer to specific haplotypes across g1 and g2 as h1 and h2.

The interaction odds ratio between h1 and h2 is calculated using the method described by Thomas (2004), IORm,n, where m and n denote dominant or recessive models imposed on h1 and h2, respectively, and 0 indicates the wildtype.

equation image

Under the null hypothesis, H0: IORm,n = 1, the odds of disease given h1 and h2 is the product of the odds of disease for each hi.

We have also implemented interaction tests based on correlation (Wu et al., 2008; Zhao et al., 2006). Correlation of specific haplotypes, h1 and h2, from locus sets g1 and g2 are performed. Following Wang et al. (2007), the correlation is determined as follows, where each individuali is assigned a value xij for locus set gj based on its MLE haplotype pairs:

equation image

The correlation between h1 and h2 is estimated by the correlation coefficient:

equation image

where An external file that holds a picture, illustration, etc.
Object name is btq616i1.jpg and An external file that holds a picture, illustration, etc.
Object name is btq616i2.jpg, j = (1, 2), and N is the number of individuals.

This correlation coefficient is an estimate of the composite correlation statistic (Zaykin et al., 2008) which is robust to Hardy–Weinberg disequilibrium. For a case–control study design, the method tests H0: rcasercontrol = 0. For a case-only H0: rcase=0 and the first step in the automated process considers the correlation between pairs of single SNPs. We also note the availability of meta-statistics for analyzing multiple datasets.

Statistical significances are determined with a MC procedure. The validity of the MC procedure is based on properly matching the null simulations with the observed data with regard to pedigree structure, missing data structure and phasing procedure (Curtis and Sham, 2006). Our MC procedure is based on a two-region multi-locus gene-drop. In both regions, haplotype pairs are assigned to founders and independent individuals based on the estimated full-length haplotype frequencies. Full-length haplotypes for both regions are then assigned to pedigree descendants using gene-dropping techniques based on Mendelian inheritance (MacCluer et al., 1986). The missing data structure is then imposed on the simulated multi-locus genotype data and the known phase is ignored. These simulated data are then statistically phased, to match the procedure performed with the observed data. The procedure generates null genotype configurations from which null statistics are calculated and a null empirical distribution created. It must be noted that this MC procedure assumes a null of no linkage and no association. If strong linkage exists (but no association), there is the potential for inflated type 1 errors; although in simulations we find that for reasonable linkage models that the MC procedure remains a good approximation for the null and type 1 errors remain valid.

Correction for the data-mining process is also available and, if selected, will provide construction-wide significance and false discovery rates. Correction for construction is implemented in the same way as for hapConstructor (Abo et al., 2008), where the null distribution for a complete construction run is generated by conducting the same search process starting from 1000 null configurations.


Our method is implemented as a Java-based program. It is an extension of the hapConstructor module (Abo et al., 2008) in the Genie software (Allen-Brady et al., 2006). The program can be run on Windows, Unix or Linux machines with Java 1.6 and at least 2 GB of RAM. An example dataset consisting of 14 SNPs in one gene and 11 SNPs in the second gene required 7 h and 11 min with 4 GB of memory to complete building to step 3. Parameter options for this example included default critical thresholds, 10 000 null simulations and no construction-wide assessment. It is important to note that this example may not provide useful insight to other implementations of the method because there are many factors that will affect the running time of the program. These include: number of SNPs, number of samples, number of null simulations selected for significance assessment, critical thresholds selected for the steps in the building process, use of the multiple-testing correction procedure and whether or not there is an association signal. Program details, including the example described above, are available at

Funding: R.A. is an NLM fellow (grant T15 LM0724); National Institutes of Health (CA 098364); the Susan G. Komen Foundation and the Avon Foundation Breast Cancer Fund (to N.J.C.).

Conflict of Interest: none declared.


  • Abo R, et al. hapConstructor: automatic construction and testing of haplotypes in a Monte Carlo framework. Bioinformatics. 2008;24:2105–2107. [PubMed]
  • Allen-Brady K, et al. PedGenie: an analysis approach for genetic association testing in extended pedigrees and genealogies of arbitrary size. BMC Bioinformatics. 2006;7:209. [PMC free article] [PubMed]
  • Becker T, et al. Haplotype interaction analysis of unlinked regions. Genet. Epidemiol. 2005;29:313–322. [PubMed]
  • Browning BL, Browning SR. Efficient multilocus association testing for whole genome association studies using localized haplotype clustering. Power. 2007;375:365–375. [PubMed]
  • Browning SR. Multilocus association mapping using variable-length Markov chains. Am. J. Hum. Genet. 2006;78:903–913. [PubMed]
  • Clark AG. The role of haplotypes in candidate gene studies. Genet. Epidemiol. 2004;27:321–333. [PubMed]
  • Combarros O, et al. Epistasis in sporadic Alzheimer's disease. Neurobiol. Aging. 2009;30:1333–1349. [PubMed]
  • Curtis D, Sham PC. Estimated haplotype counts from case-control samples cannot be treated as observed counts. Am. J. Hum. Genet. 2006;78:729–730. author reply 728–729. [PubMed]
  • Laramie JM, et al. HaploBuild: an algorithm to construct non-contiguous associated haplotypes in family based genetic studies. Bioinformatics. 2007;23:2190–2192. [PMC free article] [PubMed]
  • Lin S. Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat. Genet. 2004;36:1181–1188. [PubMed]
  • Liu J, et al. Incorporating single-locus tests into haplotype cladistic analysis in case-control studies. PLoS Genet. 2007;3:e46. [PubMed]
  • MacCluer JW, et al. Pedigree analysis by computer simulation. Zoo Biol. 1986;5:147–160.
  • Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. [PMC free article] [PubMed]
  • Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum. Hered. 2003;56:73–82. [PubMed]
  • Schaid DJ. Evaluating associations of haplotypes with traits. Genet. Epidemiol. 2004;27:348–364. [PubMed]
  • Thomas DC. Statistical Methods in Genetic Epidemiology. New York, USA: Oxford University Press; 2004.
  • Tzeng J, Zhang D. Haplotype-based association analysis via variance-components score test. Am. J. Hum. Genet. 2007;81:927–938. [PubMed]
  • Wang T, et al. Improving power in contrasting linkage-disequilibrium patterns between cases and controls. Am. J. Hum. Genet. 2007;80:911–920. [PubMed]
  • Wu X, et al. Composite measure of linkage disequilibrium for testing interaction between unlinked loci. Eur. J. Hum. Genet. 2008;16:644–651. [PubMed]
  • Zaykin DV, et al. Correlation-based inference for linkage disequilibrium with multiple alleles. Genetics. 2008;180:533–545. [PubMed]
  • Zhao J, et al. Test for interaction between two unlinked loci. Am. J. Hum. Genet. 2006;79:831–845. [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press