PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bioinfoLink to Publisher's site
 
Bioinformatics. 2013 May 15; 29(10): 1241–1249.
Published online 2013 April 18. doi:  10.1093/bioinformatics/btt139
PMCID: PMC3711507

Network-guided sparse regression modeling for detection of gene-by-gene interactions

Abstract

Motivation: Genetic variants identified by genome-wide association studies to date explain only a small fraction of total heritability. Gene-by-gene interaction is one important potential source of unexplained total heritability. We propose a novel approach to detect such interactions that uses penalized regression and sparse estimation principles, and incorporates outside biological knowledge through a network-based penalty.

Results: We tested our new method on simulated and real data. Simulation showed that with reasonable outside biological knowledge, our method performs noticeably better than stage-wise strategies (i.e. selecting main effects first, and interactions second, from those main effects selected) in finding true interactions, especially when the marginal strength of main effects is weak. We applied our method to Framingham Heart Study data on total plasma immunoglobulin E (IgE) concentrations and found a number of interactions among different classes of human leukocyte antigen genes that may interact to influence the risk of developing IgE dysregulation and allergy.

Availability: The proposed method is implemented in R and available at http://math.bu.edu/people/kolaczyk/software.html.

Contact: ude.ub@ulnehc

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Unlike Mendelian diseases, in which disease phenotypes are largely driven by mutation in a single gene locus, complex disease and traits are associated with a number of factors, both genetic and environmental, as well as lifestyle. In addition, while most Mendelian diseases are rare, many complex diseases are frightfully common, from asthma to heart disease, hypertension to Alzheimer’s and Parkinson’s to various forms of cancer.

Arguably motivated by classical successes with Mendelian diseases and traits, the study of complex diseases and traits in the modern genomics era has focused largely on the identification of individually important genes. Genome-wide association studies (GWAS), the current state of the art, have been central to the discovery of many genes in various diseases (e.g. Hindorff et al., 2010). However, unfortunately, the vast majority of genetic variants associated with complex traits identified to date explain only a small amount of the overall variance of the trait in the underlying population (Manolio et al., 2009). As a result, most GWAS findings thus far have had little clinical impact.

Currently, most GWAS are carried out one single nucleotide polymorphism (SNP) at a time. Typically, for each SNP, a model is specified, relating disease status or disease trait to the SNP plus other potentially relevant covariates. The statistical significance of each SNP is quantified through the P-value of an appropriate test. Finally, a multiple-testing correction is applied to correct the collection of P-values across SNPs. The end result is a list of SNPs declared to be significantly associated with the status or trait of interest, which in turn can be mapped to their closest genes, although some associations have been found in ‘gene deserts’ (Hindorff et al., 2010). The single-SNP approach has the important attribute that it is (relatively) computationally efficient. However, it can be severely under-powered because of the small effect size of most genetic variants identified to date (Hindorff et al., 2010; Manolio et al., 2009). Additionally, this approach does not adjust for correlation among SNPs, nor does it extend in a natural manner to search for interactions between markers. In contrast, multiple regression (i.e. where multiple SNPs are modeled simultaneously) is a natural alternative. However, naive implementation (i.e. incorporating all SNPs of interest) is both infeasible and undesirable. This is due to various reasons, including the sheer number of SNPs typically available (e.g. hundreds of thousands to millions), the comparatively small number of SNPs likely to be associated and ‘small n, large p’ problems.

Recently, however, computationally efficient multiple regression strategies for GWAS have begun to emerge that use various methods of high-dimensional variable selection (e.g. Ayers and Cordell, 2010; Logsdon et al., 2010; Ma et al., 2010; Szymczak et al., 2009; Wu et al., 2009, 2010; Zhou et al. 2010). Compared with traditional single-SNP methods, penalized regression methods have been found to yield fewer correlated SNPs (Ayers and Cordell, 2010) and to be capable of producing substantially more power while having a lower false-discovery rate (FDR; He and Lin, 2011). Furthermore, and most relevant to the current article, regression methods can include SNP by SNP interactions in a natural manner. However, to date, this typically has been done in a greedy stage-wise manner, by fitting main-effect models first and then restricting attention to interactions among those effects found significant (Wu et al., 2009, 2010). In addition, the above work makes limited or no use of supplementary biological information on, for example, biological pathways and gene function.

We propose a novel network-guided statistical methodology to facilitate the discovery of gene-by-gene (G × G) interactions associated with complex quantitative traits related to human disease, one which addresses both of the shortcomings cited above. Main effects and interaction effects in our model are chosen simultaneously, thus allowing for the possibility of detecting genes for which the marginal main effect is weak. Variable selection is done through penalized regression using sparse estimation principles. The penalty allows for the incorporation of information on biological pathways and gene function into the analysis of continuous traits related to human disease. In doing so, this penalty acts as an informal prior distribution on the set of possible G × G interactions, which in practice allows the investigator to reduce the number of interactions examined for the model from the nominal and computationally prohibitive O((number of SNPs)2) to a more manageable, say, O(number of SNPs).

Simulations indicate that, given relevant pathway information, our approach performs well in finding true interactions without losing the ability of detecting main effects, and can noticeably outperform existing stage-wise methods. In addition, application of our proposed methodology to a study of plasma total immunoglobulin E (IgE) concentrations for participants in the Framingham Heart Study (FHS) illustrates the substantial promise of the method.

The rest of this article is organized as follows: in Section 2, we describe our statistical approach. We introduce the model and our proposed penalty, describe how biological information is incorporated into the penalty and explain the optimization algorithm used for model fitting and a strategy for choosing tuning parameters. The design and results of an extensive simulation study are presented in Section 3, in which we examine models with varying degrees of interactions and penalties reflecting different extents of biological knowledge. Our analysis of the IgE concentration data is provided in Section 4. Some additional discussion may be found in Section 5.

2 METHODS

2.1 Modeling G × G interaction

Let Y be a quantitative trait of interest, and let An external file that holds a picture, illustration, etc.
Object name is btt139i1.jpg be p predictors representing SNPs. To include interactions, we are interested in a model of the form

equation image
(1)

where An external file that holds a picture, illustration, etc.
Object name is btt139i2.jpg. We expect that both the An external file that holds a picture, illustration, etc.
Object name is btt139i3.jpgs and the An external file that holds a picture, illustration, etc.
Object name is btt139i4.jpgs are sparse, as it is unlikely that there is more than a small fraction of SNPs affecting the phenotype Y, either as main effects or as interactors.

In practice, p will range from hundreds to millions. Our goal is to fit the high-dimensional model (1) to data. When p is large but only a small percentage of predictors and interactions are present in the true model, a general approach is to minimize a penalized regression criterion. Accordingly, we propose to estimate the coefficients An external file that holds a picture, illustration, etc.
Object name is btt139i5.jpg in our model using a penalized least-squares criterion. Let An external file that holds a picture, illustration, etc.
Object name is btt139i6.jpg and An external file that holds a picture, illustration, etc.
Object name is btt139i7.jpg An external file that holds a picture, illustration, etc.
Object name is btt139i7a.jpg represent our variables An external file that holds a picture, illustration, etc.
Object name is btt139i8.jpg and An external file that holds a picture, illustration, etc.
Object name is btt139i9.jpg collected over n samples. Our criterion is then written as follows:

equation image
(2)

Penalized linear regression has been found to be a powerful tool for fitting high-dimensional models, particularly in situations where the nominal number of variables is large relative to the number of observations (e.g. Bühlmann and Van De Geer, 2011). In the context of GWAS, typically An external file that holds a picture, illustration, etc.
Object name is btt139i10.jpg. Hence, it is impossible to fit a model with the full set of An external file that holds a picture, illustration, etc.
Object name is btt139i11.jpg nominal interactions among all p SNPs. However, the coefficient vector An external file that holds a picture, illustration, etc.
Object name is btt139i12.jpg is expected to be sparse. Therefore, a penalty function that enforces sparseness can be helpful here, by encouraging the optimization in (2) to find solutions in which a large percentage of the main effects and their interactions are zero, thus dropping the corresponding terms from the model.

Following standard practice, we wish to include interactions only if their corresponding main effects are also included in the model. The construction of the sparseness penalty An external file that holds a picture, illustration, etc.
Object name is btt139i13.jpg, therefore, must be handled with some care, so as to enforce the resulting hierarchical constraint among coefficients. In addition, we would like our penalty to allow for the use of biological knowledge (e.g. biological pathways, gene functional classes, etc.) in fitting the model. We address these two goals by defining a penalty of the form

equation image
(3)

where the An external file that holds a picture, illustration, etc.
Object name is btt139i14.jpg are non-negative weights provided by the investigator, and An external file that holds a picture, illustration, etc.
Object name is btt139i15.jpg is used to denote the matrix of weights over all SNP pairs An external file that holds a picture, illustration, etc.
Object name is btt139i16.jpg. The values An external file that holds a picture, illustration, etc.
Object name is btt139i17.jpg are tuning parameters.

Our penalty is a generalization of that proposed by Radchenko and James (2010) for the purpose of fitting general types of interaction models (in Radchenko and James (2010), An external file that holds a picture, illustration, etc.
Object name is btt139i18.jpg for all An external file that holds a picture, illustration, etc.
Object name is btt139i19.jpg). Note that, following those authors, we express the penalty in un-normalized form (standard lasso algorithms, for example, without interactions, assume An external file that holds a picture, illustration, etc.
Object name is btt139i20.jpg and hence An external file that holds a picture, illustration, etc.
Object name is btt139i21.jpg). It can be shown that the penalty automatically enforces the hierarchical constraint (i.e. inclusion of main effects before interactions). Main effects and interactions can be treated differently by varying An external file that holds a picture, illustration, etc.
Object name is btt139i22.jpg with respect to An external file that holds a picture, illustration, etc.
Object name is btt139i23.jpg. The elements of the matrix W are generic and allow for the possibility of including biological information a priori into the model selection process. We next describe a manner for doing so, in which network principles are used in a natural way.

2.2 A network-based penalty

Here we describe how we construct the matrix W, using information on biological pathways. Similar constructions may be had generally using other common resources (e.g., databases of genes and their biological function, such as Gene Ontology). Note that W acts as a dissimilarity matrix in An external file that holds a picture, illustration, etc.
Object name is btt139i24.jpg. Under our construction, W is defined in association with a graph showing relationships among SNPs, which in turn derives from a bipartite graph relating SNPs to pathways. The intuition underlying our construction is to (i) allow interactions only among SNPs corresponding to genes that are common to at least one pathway, and (ii) to encourage interactions more among those SNP pairs that are common to more pathways.

Let An external file that holds a picture, illustration, etc.
Object name is btt139i25.jpg denote our p SNPs, and An external file that holds a picture, illustration, etc.
Object name is btt139i26.jpg, our m pathways. We define G to be a bipartite graph, with one set of nodes representing SNPs, and the other, pathways. An edge in G connects an SNP An external file that holds a picture, illustration, etc.
Object name is btt139i27.jpg to a pathway An external file that holds a picture, illustration, etc.
Object name is btt139i28.jpg if that SNP maps sufficiently close to a gene found in the pathway. We then define An external file that holds a picture, illustration, etc.
Object name is btt139i29.jpg to be the one-mode projection of G onto the set of SNPs. Figures 1 and and22 show three toy examples of graphs G and An external file that holds a picture, illustration, etc.
Object name is btt139i30.jpg, for An external file that holds a picture, illustration, etc.
Object name is btt139i31.jpg SNPs and An external file that holds a picture, illustration, etc.
Object name is btt139i32.jpg pathways.

Fig. 1.
Simple illustration of network representations between SNPs (S1, S2, S3) and pathways (P1, P2)
Fig. 2.
One-mode projection of the three examples in Figure 1

An equivalent representation of the relationship between SNPs and pathways in the network An external file that holds a picture, illustration, etc.
Object name is btt139i33.jpg is a An external file that holds a picture, illustration, etc.
Object name is btt139i34.jpg incidence matrix M, describing which SNPs are linked to which pathways. For the three examples in Figure 1, the corresponding incidence matrices are

equation image
(4)

Similarly, the analogous An external file that holds a picture, illustration, etc.
Object name is btt139i35.jpg (weighted) adjacency matrix is the standard representation of the one-mode projection An external file that holds a picture, illustration, etc.
Object name is btt139i36.jpg. Calling this matrix A, it is related to the incidence matrix M of the original graph G through the expression An external file that holds a picture, illustration, etc.
Object name is btt139i37.jpg. For the three examples shown in Figure 1 and Figure 2, the adjacency matrices are

equation image
(5)

Finally, we define the dissimilarity matrix W element-wise by setting An external file that holds a picture, illustration, etc.
Object name is btt139i38.jpg. In the case where An external file that holds a picture, illustration, etc.
Object name is btt139i39.jpg, we set An external file that holds a picture, illustration, etc.
Object name is btt139i40.jpg by convention. Note that the resulting implication for the optimization in (2) is that An external file that holds a picture, illustration, etc.
Object name is btt139i41.jpg is set to zero, i.e. the term An external file that holds a picture, illustration, etc.
Object name is btt139i42.jpg cannot enter the model. Hence, only those pairs of SNPs An external file that holds a picture, illustration, etc.
Object name is btt139i43.jpg that share at least one pathway (i.e. An external file that holds a picture, illustration, etc.
Object name is btt139i44.jpg) may potentially enter the model. As a result, it is possible to substantially reduce the number of interaction terms considered for entry into the model, thus making the simultaneous search for main effects and interactions easier to perform. For example, in the application presented in Section 4, 17 025 SNPs were used, nominally corresponding to ~145 million interactions. However, in using the 186 pathways from the Kyoto Encyclopeida of Genes and Genomes (KEGG) database to construct our matrix W, this number was reduced to less than 480 000 potential interactions.

We note that there are certainly other ways of constructing the matrix W. For example, a variation on the procedure described above would be to define An external file that holds a picture, illustration, etc.
Object name is btt139i45.jpg if An external file that holds a picture, illustration, etc.
Object name is btt139i46.jpg, and infinity otherwise. This is equivalent to equipping the graph An external file that holds a picture, illustration, etc.
Object name is btt139i47.jpg with a binary adjacency matrix and letting An external file that holds a picture, illustration, etc.
Object name is btt139i48.jpg as before, and results in the equal treatment of all interactions that are allowed to enter the model, regardless of how many pathways are shared by pairs An external file that holds a picture, illustration, etc.
Object name is btt139i49.jpg. In addition, of course, other types of outside information—if judged relevant—can be used in place of pathways, as mentioned above.

2.3 Model selection and fitting

To perform the optimization in (2), we use cyclic coordinate descent, a now-standard choice for problems such as ours (e.g. Friedman et al., 2007; Wu and Lange, 2008; Wu et al., 2009). As the name indicates, the cyclic coordinate descent algorithm updates one element of An external file that holds a picture, illustration, etc.
Object name is btt139i50.jpg at a time using coordinate descent principles, while holding all others fixed, and cycles through all elements until convergence. In our context, the details of the resulting algorithm parallel those of Radchenko and James (2010). We, therefore, present only a sketch of the algorithm and relevant formulas here. Detailed derivation can be found in Supplementary Material, Section 1.

Consider the estimation of An external file that holds a picture, illustration, etc.
Object name is btt139i51.jpg. We note that, with respect to this parameter, the objective function in (2) can be written as

equation image
(6)

where An external file that holds a picture, illustration, etc.
Object name is btt139i52.jpg. Here An external file that holds a picture, illustration, etc.
Object name is btt139i53.jpg is the current value of An external file that holds a picture, illustration, etc.
Object name is btt139i54.jpg at this stage of our iterative algorithm, and similarly for An external file that holds a picture, illustration, etc.
Object name is btt139i55.jpg, while An external file that holds a picture, illustration, etc.
Object name is btt139i56.jpg is all of the rest of the penalty term An external file that holds a picture, illustration, etc.
Object name is btt139i57.jpg that does not involve An external file that holds a picture, illustration, etc.
Object name is btt139i58.jpg.

The updates to the estimates An external file that holds a picture, illustration, etc.
Object name is btt139i59.jpg of the main effects An external file that holds a picture, illustration, etc.
Object name is btt139i60.jpg take the form of a shrinkage estimate, An external file that holds a picture, illustration, etc.
Object name is btt139i61.jpg, for An external file that holds a picture, illustration, etc.
Object name is btt139i62.jpg. Here An external file that holds a picture, illustration, etc.
Object name is btt139i63.jpg is the solution to the problem of fitting a regression-through-the-origin for An external file that holds a picture, illustration, etc.
Object name is btt139i64.jpg on An external file that holds a picture, illustration, etc.
Object name is btt139i65.jpg, and the shrinkage parameter An external file that holds a picture, illustration, etc.
Object name is btt139i66.jpg is the solution to the equation

equation image
(7)

where An external file that holds a picture, illustration, etc.
Object name is btt139i67.jpg. The value An external file that holds a picture, illustration, etc.
Object name is btt139i68.jpg can be obtained using the Newton–Raphson method. In the special case where An external file that holds a picture, illustration, etc.
Object name is btt139i69.jpg, which must be the case when An external file that holds a picture, illustration, etc.
Object name is btt139i70.jpg for all An external file that holds a picture, illustration, etc.
Object name is btt139i71.jpg (i.e. SNP j is not allowed to participate in any interactions), Equation (7) can be solved in closed form, yielding An external file that holds a picture, illustration, etc.
Object name is btt139i72.jpg.

Now consider the estimation of An external file that holds a picture, illustration, etc.
Object name is btt139i73.jpg. Similar arguments show that the iterations in the cyclic coordinate descent algorithm involve updates of the form An external file that holds a picture, illustration, etc.
Object name is btt139i74.jpg, for An external file that holds a picture, illustration, etc.
Object name is btt139i75.jpg. Here An external file that holds a picture, illustration, etc.
Object name is btt139i76.jpg is the solution to the problem of fitting a regression-through-the-origin for An external file that holds a picture, illustration, etc.
Object name is btt139i77.jpg on An external file that holds a picture, illustration, etc.
Object name is btt139i78.jpg, where

equation image

The shrinkage parameter An external file that holds a picture, illustration, etc.
Object name is btt139i79.jpg for interaction terms is the solution to the equation

equation image
(8)

where

equation image

and

equation image

which again can be computed using the Newton–Raphson method. When An external file that holds a picture, illustration, etc.
Object name is btt139i80.jpg and An external file that holds a picture, illustration, etc.
Object name is btt139i81.jpg are both zero, An external file that holds a picture, illustration, etc.
Object name is btt139i82.jpg can be solved in closed form, yielding

equation image

The shrunken estimates of coefficients of predictors and interactions are updated in the iterative process described above until convergence is achieved. Following standard practice, on termination of our cyclic coordinate descent algorithm, we generate a final estimate of coefficients for those variables An external file that holds a picture, illustration, etc.
Object name is btt139i83.jpg and An external file that holds a picture, illustration, etc.
Object name is btt139i84.jpg that were allowed to enter the model, using ordinary least squares. All corresponding effect-size estimates and P-values produced by our methodology result from this final step.

For datasets with a small number of predictors An external file that holds a picture, illustration, etc.
Object name is btt139i85.jpg, the algorithm can be easily fit as described. However, for larger numbers of predictors, we use a ‘swindle’, in analogy to that proposed by Wu et al. (2009) and implemented in MENDEL (Lange et al., 2001). The basic idea is to apply the algorithm to a much smaller number, say k, of pre-screened predictors, and to choose the smoothing parameter(s) such that only a desired number, say An external file that holds a picture, illustration, etc.
Object name is btt139i86.jpg, of predictors An external file that holds a picture, illustration, etc.
Object name is btt139i87.jpg enters the model. The Karush–Kuhn–Tucker (KKT) conditions for our optimization problem are then checked for the estimate An external file that holds a picture, illustration, etc.
Object name is btt139i88.jpg resulting from our algorithm (augmented with zeros for coefficients of all predictors eliminated at the pre-screening stage). If the KKT conditions are satisfied, we are done; if not, we double k and repeat the process. Following Wu et al. (2009), we let our initial choice of k be a multiple of s, i.e. An external file that holds a picture, illustration, etc.
Object name is btt139i89.jpg in the applications we show. Pre-screening consists of sorting the t statistics of fitting ordinary least-square regression of Y on each predictor An external file that holds a picture, illustration, etc.
Object name is btt139i90.jpg separately (i.e. traditional GWAS) and extracting those predictors with the k largest t statistics. Details can be found in Supplementary Material, Section 2.

2.4 Choice of tuning parameters

In the penalty function An external file that holds a picture, illustration, etc.
Object name is btt139i91.jpg defined in (3), the tuning parameters An external file that holds a picture, illustration, etc.
Object name is btt139i92.jpg directly influence the number of variables that enter the final model. In principle, these two parameters may be allowed to vary freely and a cross-validation strategy used to select the best values. However, this strategy is unrealistic for GWAS, where the number of SNPs may range into millions. Instead, we use a strategy that allows investigators some control in dictating how many variables enter the model, and thereby specify the tuning parameters implicitly.

First, we impose a linear relation between the two tuning parameters, i.e. An external file that holds a picture, illustration, etc.
Object name is btt139i93.jpg. Because An external file that holds a picture, illustration, etc.
Object name is btt139i94.jpg is directly involved only in the selection of interaction terms, specifying the constant c may be interpreted as ‘tuning’ the number of interactions relative to main effects. The tuning parameter An external file that holds a picture, illustration, etc.
Object name is btt139i95.jpg is responsible for the number of main effects in the model. Because An external file that holds a picture, illustration, etc.
Object name is btt139i96.jpg is essentially a decreasing function of the number of main effects entered in the model and often investigators have at least some rough expectation of how many SNPs they feel are likely to be associated with their phenotype, we set An external file that holds a picture, illustration, etc.
Object name is btt139i97.jpg by pre-specifying the number of main effects to include in the final model (i.e. denoted s above).

Second, calculations show that the relation An external file that holds a picture, illustration, etc.
Object name is btt139i98.jpg holds, where An external file that holds a picture, illustration, etc.
Object name is btt139i99.jpg is the variance of SNP j coded as the number of minor alleles under the assumption of Hardy–Weinberg equilibrium; the variance is defined here in terms of the minor allele frequency (MAF) An external file that holds a picture, illustration, etc.
Object name is btt139i100.jpg, and r is the ratio of the thresholds for main effects and interactions to enter the model within the cyclic coordinate descent algorithm. See Supplementary Material, Section 3, for details. We recommend that c be chosen by the user through (i) specifying a desired ratio r and (ii) knowledge of the distribution of SNP MAFs.

By setting the desired number of main effects and the value c, we implicitly specify the values of the tuning parameters An external file that holds a picture, illustration, etc.
Object name is btt139i101.jpg. A smaller value of c (corresponding to a larger value of r) means more interactions may enter the model, for a fixed number of main effects.

3 SIMULATION

3.1 Simulation study design

We carried out a simulation study to assess (i) the performance of our method under various interaction scenarios and (ii) the effect of different choices of the W matrix in our penalty on our ability to detect interaction. We also compared our method with the stage-wise selection method proposed by Wu et al. (2009), which restricts interaction search to SNPs first declared to have main effects. In each simulated dataset, there are 1000 subjects and 1000 SNPs as predictors. The SNPs are coded additively (0, 1, 2), simulated with a MAF of 50%, and drawn from a binomial distribution with two trials. Lower MAFs were also investigated (MAF ≥ 10%, see Supplementary Material, Section 4.5). The quantitative trait Y is then simulated using the effect SNPs and interactions specified under assumed models. The effect sizes are set for 80% power under standard single-SNP additive models. Among the 1000 SNPs, 20 (SNPs 1–20) have true main effects on the simulated trait, and the remaining 980 SNPs have no effect.

To test our method in various interaction situations, we evaluate three different models:

  • Model 1: only 20 main effects with no interaction,
  • Model 2: 20 main effects + all two way interactions among SNPs 1–5, and
  • Model 3: 20 main effects + SNP1 × SNP2 + SNP3 × SNP4 + SNP5 × SNP6 + … + SNP19 × SNP20.

Model 1 has no interactions involved. Models 2 and 3 both have 10 interaction terms involved, and the interactions are all among true main effects. However, in Model 2, there is one cluster with 5 interacting SNPs, while in Model 3 there are 10 clusters, each with two interacting SNPs.

In addition, we explore six different ways to construct the W matrix used in the penalty. In each case, we allow all SNPs to be evaluated as possible main effects, by having all ones down the diagonal of W. For the possible interaction terms, coded by the off-diagonal elements of W, we consider the following additions:

  • An external file that holds a picture, illustration, etc.
Object name is btt139i102.jpg: + true interactions in models,
  • An external file that holds a picture, illustration, etc.
Object name is btt139i103.jpg: + two-way interactions among all true main effects (SNPs 1–20),
  • An external file that holds a picture, illustration, etc.
Object name is btt139i104.jpg: + true interactions + random ‘noise’ interactions,
  • An external file that holds a picture, illustration, etc.
Object name is btt139i105.jpg: + two-way interactions among all true main effects + random ‘noise’ interactions,
  • An external file that holds a picture, illustration, etc.
Object name is btt139i106.jpg: + two-way interactions among SNPs 1–40 (all true main effects and 20 non-active SNPs), and
  • An external file that holds a picture, illustration, etc.
Object name is btt139i107.jpg: + two-way interactions among SNPs 1–10 and 21–30 + two-way interactions among SNPs 11–20 and 31–40.

The matrix An external file that holds a picture, illustration, etc.
Object name is btt139i108.jpg is an ideal case. It only allows true interactions built in the model to enter that model. Note that An external file that holds a picture, illustration, etc.
Object name is btt139i109.jpg is different for each of Models 1, 2 and 3. The matrix An external file that holds a picture, illustration, etc.
Object name is btt139i110.jpg introduces some ‘noise’ interactions by allowing all interactions among true main effects. It is equivalent to a single pathway of SNPs 1–20 and is the same for all models. The matrix An external file that holds a picture, illustration, etc.
Object name is btt139i111.jpg adds random ‘noise’ interactions to An external file that holds a picture, illustration, etc.
Object name is btt139i112.jpg, while An external file that holds a picture, illustration, etc.
Object name is btt139i113.jpg adds random ‘noise’ interactions to An external file that holds a picture, illustration, etc.
Object name is btt139i114.jpg. Note that An external file that holds a picture, illustration, etc.
Object name is btt139i115.jpg and An external file that holds a picture, illustration, etc.
Object name is btt139i116.jpg both vary across models. The random ‘noise’ interactions are introduced in a manner aimed at mimicking the interaction structure corresponding to the KEGG database, only some subset of which will likely be relevant to any given study (and the rest, ‘noise’). Specifically, an additional set of ‘pathways’ (i.e. gene sets) were defined, in addition to those defined by the models themselves, until 20 pathways were formed. To these 20, we then randomly allocated 160 additional SNPs so that the average number of SNPs per pathway roughly mimicked what is observed in KEGG. An external file that holds a picture, illustration, etc.
Object name is btt139i117.jpg represents a single pathway of SNPs 1–40, similar to An external file that holds a picture, illustration, etc.
Object name is btt139i118.jpg but with more SNPs (20 non-active SNPs) involved. An external file that holds a picture, illustration, etc.
Object name is btt139i119.jpg then represents two pathways with each having 10 active and 10 non-active SNPs. It is similar to An external file that holds a picture, illustration, etc.
Object name is btt139i120.jpg in the sense that the allowed interactions involve SNPs 1–40, but An external file that holds a picture, illustration, etc.
Object name is btt139i121.jpg has smaller amount of non-active interactions.

We chose An external file that holds a picture, illustration, etc.
Object name is btt139i122.jpg by setting the desired number of main effects selected as 25, the value of An external file that holds a picture, illustration, etc.
Object name is btt139i123.jpg is automatically determined by our program once the value 25 is provided. This is a natural choice because there are 1000 SNPs in our data and 20 true main effects in the models. This choice will affect type I error because at least 5 of the 25 predictors selected as main effects will be false, but this number is modest compared with the total of 1000 SNPs and can be easily adjusted by resetting An external file that holds a picture, illustration, etc.
Object name is btt139i124.jpg according to investigator preference. The parameter c is set to 0.5 (i.e. An external file that holds a picture, illustration, etc.
Object name is btt139i125.jpg under our model). The selected predictors are then ranked by their absolute t-values resulting from the ordinary least-square fit on the selected predictors for the final model. By setting a threshold on the rank, we choose the number of interactions to be reported and compare the performance of interaction selection under various W matrix specifications across a range of thresholds.

3.2 Simulation results

In Figure 3, we compare the results under various W matrix specifications, for Models 2 and 3. We assess the ability to find true interactions by computing the average FDR of interactions over 100 trials and plotting 1 − FDR against the rank threshold for selected interactions. As the threshold increases, more interactions get selected, and thus FDR increases and the curves have a downward trend. Examining the results, we see that An external file that holds a picture, illustration, etc.
Object name is btt139i126.jpg clearly has the best performance, as it reflects the truth about the interactions in the model; all false interactions are excluded a priori, and thus the 1 − FDR curve for An external file that holds a picture, illustration, etc.
Object name is btt139i127.jpg is a straight line at 1. Recall that An external file that holds a picture, illustration, etc.
Object name is btt139i128.jpg is equivalent to An external file that holds a picture, illustration, etc.
Object name is btt139i129.jpg plus random ‘noise’. Importantly, therefore, we note that pure ‘noise’ among non-active SNPs does not appear to impact much the selection of true interactions, as An external file that holds a picture, illustration, etc.
Object name is btt139i130.jpg has the second best performance after An external file that holds a picture, illustration, etc.
Object name is btt139i131.jpg. This conclusion is reinforced by the results for An external file that holds a picture, illustration, etc.
Object name is btt139i132.jpg and An external file that holds a picture, illustration, etc.
Object name is btt139i133.jpg, where the 1 − FDR curves are nearly identical. In contrast, the results in Figure 3 also suggest that selection of interactions is to some extent adversely affected when allowing ‘noise’ interactions among active SNPs, as An external file that holds a picture, illustration, etc.
Object name is btt139i134.jpg has better performance than An external file that holds a picture, illustration, etc.
Object name is btt139i135.jpg and An external file that holds a picture, illustration, etc.
Object name is btt139i136.jpg, while An external file that holds a picture, illustration, etc.
Object name is btt139i137.jpg and An external file that holds a picture, illustration, etc.
Object name is btt139i138.jpg have similar performance.

Fig. 3.
Interaction results of Model 2 and Model 3 under 6 W matrix specifications and Mendel analysis

In comparing our method with that of Wu et al. (2009), as implemented in Mendel, we can see in Figure 3 that our method outperforms stage-wise selection for all choices considered for the matrix W. This observation is significant in showing that using accurate prior information, even with moderate ‘noise’ (i.e. specifying non-existent interactions), it is possible to outperform the stage-wise approach by over 10–20% on the 1 − FDR scale. Note that we used the default option in Mendel that tests interactions among selected main effects. There are other options in Mendel one can choose that may perform somewhat better.

With respect to the detection of main effects, the performance of our methodology is shown in Table 1. The average power of main effects are grouped into three categories: the true SNPs involved in interaction, true SNPs not involved in interaction and the SNPs that have no effect on the simulated trait. Recall that there is no interaction in Model 1, and all true SNPs in Model 3 are involved in interaction, so they have only two relevant groups of SNPs. As we can see from the Model 2 result, SNPs involved in interactions are detected more easily than SNPs not involved in interactions. Comparing Table 1 with Table 2, we can also see that our method has the same or higher average power for detecting true main effects than the stage-wise approach of Wu et al. (2009), as implemented in Mendel. In both approaches, the non-active SNPs have a small chance of being declared as main effects.

Table 1.
Simulation results for detection of main effects
Table 2.
Detection of main effects by stage-wise competitor

We also tested two more cases where main-effect sizes were moderate and weak, corresponding to 50% power and 20% power, respectively, under standard single-SNP additive models. To assess the performance of our method in finding interactions under these various strengths of main effects, we reverse the direction of interactions so that there are no marginal SNP effects. The W matrix we used is An external file that holds a picture, illustration, etc.
Object name is btt139i145.jpg, as described before, to make a fair comparison with respect to the inclusion of noise interactions. The results under such models are shown in Figure 4. As we can see from the figure, the approach implemented using Mendel could not find true interactions under any of the models [the regular (Mendel), the moderate (Mendel.moderate) and the weak (Mendel.weak) main-effect models], as it only searches for interactions among main effects selected in the first stage. In contrast, our proposed approach is able to find some of the true interactions because it incorporates information from the W matrix, the network of interactions built from outside knowledge. Not surprisingly, the model with stronger main effect (M2W2, M3W2) performs better in finding true interactions than moderate (M2W2.moderate, M3W2.moderate) or weak (M2W2.weak, M3W2.weak) main-effect models.

Fig. 4.
Interaction results of Model 2 and Model 3 without marginal main effect

There are a number of other important questions that one can explore in simulations. We further checked four of them and found that (i) our approach outperforms simple association tests; (ii) scaling up data size by adding more ‘noise’ SNPs makes it harder to find true main effects but does not adversely affect the selection of interactions’ (iii) besides the obvious advantage of decreasing computing time, using network information in our penalty yields advantages in detecting interactions beyond that deriving from the hierarchical nature of the penalty; and (iv) our approach is robust to modest linkage disequilibrium among SNPs. A detailed description of these results can be found in Supplementary Material, Section 4.

4 APPLICATION TO IGE CONCENTRATION

We applied our algorithm to evaluate G × G interactions for log plasma IgE concentration, a biomarker that is often elevated in individuals with allergy to environmental allergens. An elevated plasma IgE concentration is associated with allergic diseases, including asthma, allergic rhinoconjunctivitis, atopic dermatitis and food allergy. Although several genes influencing IgE concentrations have been identified to date, the interactions among these genes or others yet to be identified to be important players have not been studied (Granada et al., 2012).

We sought to investigate G × G effects on log IgE concentration in the FHS cohorts. Participants from the town of Framingham, Massachusetts, have been recruited in these studies starting in 1948, and have been followed over the years for the development of heart disease and related traits, including pulmonary function and allergic response measured by IgE concentration. Our analyses include 6975 participants, 441 from the original cohort recruited in 1948, an additional 2848 from the Offspring cohort recruited in 1971 and finally 3686 participants from the Third Generation cohort initiated in 2002. A recent GWAS on Framingham participants identified new genetic loci associated with plasma total IgE concentrations (Granada et al., 2012). We are interested in looking at G × G interactions associated with IgE concentration, as an illustration of our methodology.

4.1 Preliminaries

Genotypes were from Affymetrix 500K and MIPS 50K arrays, with imputation performed using HapMap 2 European reference panel (Li et al., 2010). Dosage genotypes (expected number of minor alleles) were used in our analysis, although the software implementation of the Wu et al. (2009) approach (Mendel) required genotypes to be coded as 0, 1 or 2 and could not handle dosage. Therefore, in our analysis using Mendel, for each individual, we used the genotype with the highest posterior probability at each SNP. We analyze the natural logarithm of plasma total IgE concentrations as our phenotype (i.e. Y), adjusted for smoking status (current, former and amount of life time smoking in terms of pack-years), age, sex and cohort of origin. A total of 6975 participants (3209 men and 3766 women) aged 19 years and older had good-quality genotypes and were included in our analysis. Familial relationship was ignored when applying our algorithm and the Wu et al. (2009) approach, but we subsequently applied linear mixed-effect models to account for familial correlation to obtain estimates of effect sizes.

Some pre-processing was used to select a set of SNPs to include in our analysis. First, we attempted to map each of the 2 411 590 genotyped and imputed SNPs in the dataset to a reference gene containing it. If no such gene was available, then we mapped the SNP to the closest reference gene within 60 kb of the SNP, if available. Otherwise, the SNP was excluded. After establishing this mapping between genes and SNPs, some genes were found to include multiple SNPs. We kept only one SNP for each gene, selecting in each case that SNP most significantly associated with the phenotype, based on a linear mixed-effect regression. As a result, the SNPs in the final dataset have low linkage disequilibrium (correlation) and a unique SNP-to-gene correspondence (additional analyses suggest that our results, reported below, are fairly robust to modest amounts of disequilibrium in these data. See Supplementary Materials, Section 4.4).

The final dataset has 17 025 SNPs/genes. We used the KEGG pathway database to build our W matrix, following the steps described in Section 2. The KEGG pathway database has a total of 72 354 genes and 5268 unique genes, resulting in 479 066 interactions allowed in our W matrix.

4.2 Results

For our analysis on 17 025 SNPs, we chose to look for 10 main effects, although we allowed the algorithm to terminate after selecting 10 ± 1 main effects, resulting in 9 main effects selected in the current analysis. The parameter c was set to 0.1, which, based on an average estimated SNP variance of 0.27 for these data, corresponds to An external file that holds a picture, illustration, etc.
Object name is btt139i146.jpg. Nine interactions were found in our approach, yielding a model with An external file that holds a picture, illustration, etc.
Object name is btt139i147.jpg variables. To calibrate our results with those from the stage-wise procedure of Wu et al. (2009), as implemented in Mendel, the latter was run to select 9 variables in the first stage (i.e. fitting only main effects), and then 15 variables in the second stage (i.e. fitting both main effects and interactions, selected from among the 9 SNPs resulting from the first stage). This process produced a final model with nine main effects and six interactions. In terms of computing time, our analysis ran in ~5 min on our cluster Linga, equipped with two Intel Xeon CPUs E5345 @ 2.33 GHz, with four cores each, and 16 GB/32 GB of RAM for each node (the job was submitted to one node and used one core), while the analysis in Mendel ran in ~2 min. Given that our method evaluates An external file that holds a picture, illustration, etc.
Object name is btt139i148.jpg times more potential interactions than Mendel, the observed trade-off between computing time and number of possible interactions being evaluated appears to be reasonable.

The results from our proposed method and from the stage-wise procedure are shown in the left and right, respectively, of Table 3. The estimates of effect size and the ranks are from the linear mixed-effect model for the final model after variable selection procedure, for both methods. Genes previously found in a GWAS of these FHS data (Granada et al., 2012) are indicated with an asterisk in the table. In our approach, four of the six interaction pairs involved human leukocyte antigen (HLA) genes, which encode antigen-presenting cell-surface proteins that are key regulators of the immune response. The other two interactions identified were among genes both previously associated with log IgE concentrations (Granada et al., 2012). In contrast, Mendel did not detect any interactions among genes in the HLA regions or among pairs of previously associated genes.

Table 3.
Results of application to IgE concentration data

From a biological perspective, a number of the interactions discovered by our method are of non-trivial potential interest. The major histocompatibility complex (MHC) class I antigens HLA-A, -B and -C are involved with cell-mediated immunity targeting cells expressing proteins produced intracellularly, for example, by viruses, while the MHC class II antigens HLA-DP, -DQ and -DR play key roles with humoral immunity, including the production of IgE antibodies directed against environmental allergens (Klein and Sato, 2000). HLA-G is a non-classical MHC class I antigen that may have immunomodulatory effects through actions on natural killer cells, T lymphocytes and antigen-presenting cells (Carosella et al., 2008). Genetic variants in these different classes of HLA genes—each class influencing a different but interconnected aspect of immune function—could well interact to influence the risk of developing IgE dysregulation and allergy. The observed interaction between SNPs in the alpha chain of the high affinity receptor for IgE (FCER1A) and interleukin (IL)-13 genes may reflect a number of mechanisms. For example, a genetic variant causing increased expression of FcAn external file that holds a picture, illustration, etc.
Object name is btt139i149.jpgRIAn external file that holds a picture, illustration, etc.
Object name is btt139i150.jpg on mast cells would lead to increased antigen-induced activation of these cells, which would consequently produce more IL-13 (Burd et al., 1995), leading to more class-switch recombination and IgE production. Genetic variation of FcAn external file that holds a picture, illustration, etc.
Object name is btt139i151.jpgRIAn external file that holds a picture, illustration, etc.
Object name is btt139i152.jpg on classical antigen-presenting cells may also promote Th2 cell activation (Potaczek et al., 2009) with consequent IL-13 release. Thus, SNPs in these two genes in the same pathway leading to increased IgE production could have synergistic effects. Overall, identification of these interactions may help identify the children at highest risk for developing allergy, possibly helping focus interventions to prevent allergy, and may provide new insights into the genetic basis and mechanisms of allergy.

5 DISCUSSION

There are many potential sources of missing hereditability. G × G interactions is one such source. In turn, there are many types of genetic interactions, including multiplicative and non-multiplicative (Mukherjee et al. 2008, 2012). In this article, we focus on investigating multiplicative interactions in the form of a product between two variables. Our proposed methodology provides a promising new approach to tapping this source, by exploiting the wealth of biological knowledge accumulated in various pathway databases.

The simulations reported here suggest that our approach performs better in finding true interactions with a reasonable prior biological knowledge incorporated, compared with the stage-wise regression method that first fits a main-effect model and then searches for interactions among selected main effects. Furthermore, the real-data results are promising in suggesting that better performance likely may be realized in real data as well.

Future work to be done on this topic includes extending the computational algorithm to account for linkage disequilibrium among SNPs, and producing a software implementation that uses standard formatted files such as genotype files from the PLINK (Purcell et al. 2007) software package.

Funding: National Institute Health (grants DK078616, ES020827 GM078987, AG028321, N01 HC25195 and P01 AI050516) (in part). A portion of this research used the Linux Cluster for Genetic Analysis (LinGA-II), funded by the Robert Dawson Evans Endowment of the Department of Medicine at Boston University School of Medicine and Boston Medical Center.

Conflict of Interest: none declared.

REFERENCES

  • Ayers K, Cordell H. SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet. Epidemiol. 2010;34:879–891. [PMC free article] [PubMed]
  • Bühlmann P, Van De Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer-Verlag New York Inc; 2011.
  • Burd P, et al. Activated mast cells produce interleukin 13. J. Exp. Med. 1995;181:1373–1380. [PMC free article] [PubMed]
  • Carosella E, et al. Hla-g: from biology to clinical benefits. Trends Immunol. 2008;29:125–132. [PubMed]
  • Friedman J, et al. Pathwise coordinate optimization. Ann. Appl. Stat. 2007;1:302–332.
  • Granada M, et al. A genome-wide association study of plasma total ige concentrations in the framingham heart study. J. Allergy Clin. Immunol. 2012;129:840–845. [PMC free article] [PubMed]
  • He Q, Lin D. A variable selection method for genome-wide association studies. Bioinformatics. 2011;27:1. [PMC free article] [PubMed]
  • Hindorff L, et al. A catalog of published genome-wide association studies. National Human Genome Research Institute. 2010 http://www. genome. gov/gwastudies.
  • Klein J, Sato A. The HLA system. First of two parts. N Eng. J. Med. 2000;343:702–709. [PubMed]
  • Lange K, et al. Mendel version 4.0: a complete package for the exact genetic analysis of discrete traits in pedigree and population data sets. Am. J. Hum. Genet. 2001;69:504. [PubMed]
  • Li Y, et al. Mach: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. [PMC free article] [PubMed]
  • Logsdon B, et al. A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC Bioinformatics. 2010;11:58. [PMC free article] [PubMed]
  • Ma S, et al. Identification of non-Hodgkin’s lymphoma prognosis signatures using the CTGDR method. Bioinformatics. 2010;26:15. [PMC free article] [PubMed]
  • Manolio T, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. [PMC free article] [PubMed]
  • Mukherjee B, et al. Tests for gene-environment interaction from case-control data: a novel study of type i error, power and designs. Genet. Epidemiol. 2008;32:615–626. [PubMed]
  • Mukherjee B, et al. Testing gene-environment interaction in large-scale case-control association studies: possible choices and comparisons. Am. J. Epidemiol. 2012;175:177–190. [PMC free article] [PubMed]
  • Potaczek D, et al. Genetic variability of the high-affinity ige receptor α-subunit (fcεriα) Immunol. Res. 2009;45:75–84. [PubMed]
  • Purcell S, et al. Plink: a toolset for whole-genome association and population-based linkage analysis. Am. J. Hum. Genet. 2007;81:559–575. (Software PLINK v1.07) http://pngu.mgh.harvard.edu/purcell/plink/ [PubMed]
  • Radchenko P, James G. Variable selection using adaptive nonlinear interaction structures in high dimensions. J. Am. Stat. Assoc. 2010;105:1541–1553.
  • Szymczak S, et al. Machine learning in genome-wide association studies. Genet. Epidemiol. 2009;33(Suppl. 1):S51–S57. [PubMed]
  • Wu J, et al. Screen and clean: a tool for identifying interactions in genome-wide association studies. Genet. Epidemiol. 2010;34:275–285. [PMC free article] [PubMed]
  • Wu T, Lange K. Coordinate descent algorithms for lasso penalized regression. Annals. 2008;2:224–244.
  • Wu T, et al. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25:714. [PMC free article] [PubMed]
  • Zhou H, et al. Association screening of common and rare genetic variants by penalized regression. Bioinformatics. 2010;26:2375. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press