Our goal was to determine if examining genetic variants jointly could account for substantially more of the variance in phenotype than would be estimated by summing the results of univariate analyses. We identified several variants that jointly explain more phenotypic variance than they do individually, and three of the four tested signals replicate in independent data. Our results highlight two important lessons: (1) joint effects need to be investigated across all SNPs, not just those with main effects, and (2) it is not sufficient to look at the significance of an interaction term in a logistic regression model to determine if there is a joint effect.
The prime illustration for both of these points is the joint effect of the loci tagged by rs16969968 and rs3743075: as illustrated in , univariate analysis of the COGEND data indicates that the genetic locus tagged by rs16969968 displays a highly significant association to nicotine dependence, and the second locus, tagged by rs3743075, does not display evidence of association to nicotine dependence under univariate analyses. However, when the two genetic loci are analyzed jointly using the RPM, this pair accounts for much more of the variance in the dichotomous nicotine dependence/non-dependence pheno-type than the sum of the univariate effects. This same pattern of results is seen in the ACS data: univariate analysis finds that the locus tagged by rs16969968 displays a substantial univariate effect on the dichotomous heavy versus light smoking phenotype, and the second locus is not associated with the phenotype in a univariate analysis. Again, a joint RPM analysis of the two SNPs accounts for considerably more of the phenotype variance than the sum of the univariate results. Although analysis by logistic regression also displayed the increase in explained variance (results not shown), the interaction term was not significant. As a result, we would have missed this joint effect had we required a significant interaction coefficient.
The case for these two loci having a synergistic joint effect has been greatly strengthened by results very recently reported by the Consortium for the Genetic Analysis of Smoking Phenotypes (CGASP) (Saccone et al. 2010a
). The CGASP subjects consist of current and former smokers of European ancestry (N
= 38,617), including COGEND and ACS subjects. A case/control phenotype of heavy versus light smoking was analyzed. In univariate analysis, the locus tagged by rs16969968 was highly associated with the heavy smoking phenotype (p
= 5.96 × 10–31
), while the second locus, tagged by rs588765, was considerably less significant (p
= 4.54 × 10–4
). When the two loci were analyzed jointly, each became more significant: p
= 3.52 × 10–36
for the locus tagged by rs16969968, and p
= 6.03 × 10–9
(passing the threshold for genome-wide significance) for the locus tagged by rs588765. The logistic regression interaction term in this analysis was not significant. The explanation for these results is that the risk alleles for the two loci are negatively correlated. As a result, in a univariate analysis, the association of rs16969968 to nicotine dependence is present, but dampened, while the effect of the second locus is almost completely masked until jointly analyzed.
Although we cannot know how often this kind of joint effect might occur, we find this result, displayed in COGEND data, ACS data, and the larger CGASP data, a strong argument that tests for interactions not be limited to loci with substantial main effects, and that the consideration of joint effects should not be limited to ones giving rise to a statistically significant interaction term. It is an advantage of the RPM, focused on identifying sources of trait variance and not specifically interactions, that it identifies such pairs in an automated fashion, without the need to examine each model individually (an impractical approach when evaluating many thousands or millions of multi-locus models).
Although a replicable joint effect does not require a significant logistic regression interaction term, it appears that if the proportional increase in explained variance is great enough (e.g. over 80%), logistic regression can detect a significant interaction term. The first SNP pair in , rs16969968 and rs2133965, with nearly 90% increase in explained variance, was the only pair from this table for which the logistic regression interaction term was associated with a p value <0.05. In contrast, all five pairs from , which focused on large proportional increases in explained variance, had interaction p values <0.05.
The variants rs16969968, a non-synonymous coding SNP in CHRNA5
, and rs1051730, a synonymous SNP, are of particular interest because these SNPs are strongly associated with nicotine dependence and smoking behaviors (Amos et al. 2008
; Berrettini et al. 2008
; Liu et al. 2010
; Thorgeirsson et al. 2008
; Thorgeirsson et al. 2010
; Tobacco and Genetics Consortium 2010
). These two SNPs are highly correlated (r2
= 0.991) in our data, and we included rs16969968 in our list of 127 to be tested. We selected this SNP because it results in an amino acid change, and in vitro studies demonstrate that it alters receptor function (Bierut et al. 2008
). Because SNPs were thinned to include only those with r2
< 0.95, rs1051730 was not included in our primary analyses. We confirmed, with secondary univariate and joint analyses using rs1051730, that the results were very similar to those for rs16969968. (An analog of , replacing rs16969968 throughout with rs1061730, can be found in Supplemental Table S2
.) We cannot say which, if either, of these two SNPs is biologically linked to smoking behavior. We can only say that they are representatives of a cluster of variants in tight linkage disequilibrium that display these effects. Further research in the lab will be required to identify the causative variants definitively.
We focused on cholinergic nicotinic receptor subunit genes because the protein products physically combine to form biologically active receptors that bind nicotine. For example, the α
4 and α
5 nicotinic acetylcholine receptor (nAChR) subunits combine with the nAChR β
2 subunit to form an α
receptor that is expressed in various brain regions including the mesolimbic reward pathway (McClure-Begley et al. 2009
; Salminen et al. 2004
; Zoli et al. 2002
). Thus, it is plausible that variants in and around the genes that form these subunits may alter the subunit makeup of nicotinic receptors or the relative expression of the receptors. Of course, joint effects for these genes need not be limited to changes that alter an amino acid, but can include variants that alter splicing, mRNA expression, stability, or other regulatory factors. Our analytic evidence of joint effects can suggest models that can be tested biologically in the laboratory.
We have presented evidence that examining factors jointly, including SNPs displaying little to no univariate association to the phenotype, may uncover interactions and other synergistic effects that account for a sizable portion of the “missing” genetic variance remaining after univariate analyses of well-powered GWAS (Goldstein 2009
; Hirschhorn 2009
; Kraft and Hunter 2009
). Although this study on the genetics of nicotine dependence focused on only a small portion of the genome, we found that examining nicotinic receptor variants in a pair-wise manner increased the proportion of variation explained. The challenge with such an approach, particularly if applied to data with 1,000,000 genotyped polymorphisms, is the number of tests. Even if limited to all pair-wise models, such analyses would be computationally expensive and statistically intractable. The problem becomes more extreme if analyses involve more than two factors at a time.
Two approaches for addressing this problem have been applied in this study. One approach to address both the computational and statistical problem of the large number of tests is to filter the polymorphisms to be analyzed based on biological plausibility. In this case, we limited our analyses to polymorphisms located in or in linkage dis-equilibrium with nicotinic receptor subunit genes and filtered so that no two SNPs in our analysis had r2 > 0.95. Other filtering approaches could make use of additional biological information such as evolutionary conservation, biological pathways, and results from previous studies.
An approach to address the statistical problem of the large number of tests is to use a second, independent dataset. Although filtering may provide considerable help with the computation challenges, the number of tests will likely remain very large and the effects to be detected relatively modest. As a result, the issue of statistical significance will likely remain even after judicious a priori filtering of predictors. One way to address this is to use the initial data for hypothesis generation and an independent dataset to test a small number of hypotheses. If the key biological contributors to the phenotype are identified in the initial data, we would expect that the signal would replicate in an ethnically matched, independent replication sample.
The focus of this study on pair-wise analyses is one of its limitations. The reasons for this choice include limiting the large number of statistical tests performed, the theoretical results suggesting that factors not detectible with pair-wise analyses are unlikely to display large effects (Culverhouse et al. 2002
), and the fact that our hypothesis-generating sample was of modest size for examining joint effects. We recognize that higher-order joint effects with three or more variants may play important roles in the etiology of complex disease, but larger datasets are needed to evaluate such complex multi-SNP effects. A second limitation of the study is that we were not able to test all of the identified signals in independent data.
This study was also limited to two analytic methods for examining joint genetic effects. A variety of methods are available to investigate how multiple genetic variants contribute jointly to a dichotomous trait. In addition to the RPM and the traditional logistic regression, there are other partition-based methods [e.g. the Combinatorial Partition Method (CPM) (Nelson et al. 2001
), multifactor dimensionality reduction (MDR) (Hahn et al. 2003
), the Generalized MDR (GMDR) (Lou et al. 2007
)], as well as information theory based methods [e.g. k-way interaction information (KWII) and total correlation information (TCI) (Chanda et al. 2007
)], and a variety of other approaches. Because these approaches each have their advantages for detecting joint effects, an analysis using multiple methods can identify additional signals that might have been missed by a single method, as well as provide increased confidence in signals that are identified by multiple methods.
The results of this study support three ideas which we believe are important for future studies. First, this approach identified several variants that have minimal evidence of association when tested individually, but which have larger effects when examined in combination with other genetic predictors. Second, requiring a significant interaction term in logistic regression testing may not detect variants that jointly explain more of the variance. Finally, increases in explained variance can be obtained by including only a small number of interactions.
We note that as more researchers begin specifically looking for joint effects, the reports of interactions have increased. This is particularly true for genetic studies of smoking. In addition to the CGASP result mentioned above, several other studies have reported interesting results from the joint analysis of multiple loci chr15q25 in samples of European descent (Li et al. 2010a
; Liu et al. 2010
; Thorgeirsson et al. 2010
; Tobacco and Genetics Consortium 2010
), African Americans (Li et al. 2010a
), and Koreans (Li et al. 2010b
). Additional joint genetic effects related to smoking have been reported in other candidate regions, including interactions among variants in GABBR1 and GABBR2 related to nicotine dependence in African and European Americans (Li et al. 2009
) and an interaction between CYP2A6 and MAOA affecting smoking in Chinese (Tang et al. 2009
As researchers struggle to make the best use of the wealth of genetic data now available, a key goal is to account for more of the phenotypic variance that is expected to be attributable to genetics. The results of this study suggest that pair-wise examination of genetic data can be a useful tool for achieving this goal, uncovering substantial additional genetic contribution to phenotypic variance through joint effects of multiple loci.