Home | About | Journals | Submit | Contact Us | Français |

**|**Biostatistics**|**PMC2742497

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. A SIMPLE MOTIVATING EXAMPLE
- 3. METHODS
- 4. SIMULATION STUDIES
- 5. CASE STUDY: BREAST CANCER DATA
- 6. DISCUSSION
- FUNDING
- SUPPLEMENTARY MATERIAL
- References

Authors

Related links

Biostatistics. 2009 October; 10(4): 706–718.

Published online 2009 July 22. doi: 10.1093/biostatistics/kxp025

PMCID: PMC2742497

Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, 1300 University Avenue, Madison, WI 53706-1510, USA ; Email: ude.csiw.tatsoib@roizdnek

Received 2008 December 8; Revised 2009 April 27; Accepted 2009 May 26.

Copyright © The Author 2009. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

This article has been cited by other articles in PMC.

Network reconstruction is a main goal of many biological endeavors. Graphical Gaussian models (GGMs) are often used since the underlying assumptions are well understood, the graph is readily estimated by calculating the partial correlation (paCor) matrix, and its interpretation is straightforward. In spite of these advantages, GGMs are limited in that interactions are not accommodated as the underlying multivariate normality assumption allows for linear dependencies only. As we show, when applied in the presence of interactions, the GGM framework can lead to incorrect inference regarding dependence. Identifying the exact dependence structure in this context is a difficult problem, largely because an analogue of the paCor matrix is not available and dependencies can involve many nodes. We here present a computationally efficient approach to identify bivariate interactions in networks. A key element is recognizing that interactions have a marginal linear effect and as a result information about their presence can be obtained from the paCor matrix. Theoretical derivations for the exact effect are presented and used to motivate the approach; and simulations suggest that the method works well, even in fairly complicated scenarios. Practical advantages are demonstrated in analyses of data from a breast cancer study.

The reconstruction of biological networks is a key component of efforts to better understand, diagnose, and treat disease. With advances in high-throughput technologies, the data necessary for reconstruction are now available and so too are a variety of statistical reconstruction methods (for a review and comparison of several methods, see Hartemink, 2005; Werhli *and others*, 2006; Markowetz and Spang, 2007). Gene association networks (GANs) are undirected graphs often used as an efficient way to summarize dependencies among genes, while being less computationally intensive than directed networks. In a GAN, most often nodes represent gene expression measurements and edges quantify the association between 2 genes measured using the Pearson correlation or the mutual information in the case of relevance networks (RNs) (Butte and Kohane, 2003), or the partial correlation (paCor) in the case of graphical Gaussian models (GGMs) (Schäfer and Strimmer, 2005a, 2005b). While learning RNs from data is straightforward, their main disadvantage is that associations are not inferred in the context of the entire gene set under investigation. As a result, a strong association between 2 genes is not necessarily due to a direct dependence; often times it is due to associations with other genes in the system under study.

The GGM framework allows one to distinguish between direct and indirect associations. In short, nodes are assumed to follow a multivariate normal (MVN) distribution and an edge represents conditional dependence between 2 nodes, given all other nodes (Lauritzen, 1996; Whitakker, 1990). The presence of edges corresponds exactly to nonzero paCors, and in this way estimation of the graph is equivalent to estimation of the paCor matrix. This simple, yet effective, framework has proven useful in a number of genomic studies (Toh and Horimoto, 2002; Wille *and others*, 2004; Matsuno *and others*, 2006; Ma *and others*, 2007; Keller *and others*, 2008).

In spite of the clear advantages, GGMs are limited in that they do not account for multiplicative dependence among nodes. Each variable is assumed normal, with conditional expectation a *linear* combination of some subset of the other variables. More precisely, a necessary and sufficient condition for a random vector (*X*_{1},*X*_{2},…,*X*_{p}) to follow an MVN distribution with mean *μ* = (*μ*_{1},…,*μ*_{p}) and covariance matrix Σ is that for all *k* = 1,…,*p*, the conditional distribution of *X*_{k} given the remaining *p* − 1 variables, *X*_{ − k}, depends on *X*_{ − k} only through the conditional mean *E*[*X*_{k}|*X*_{ − k}] = *α*_{k} + *B*_{k}*X*_{ − k}, where *B*_{k} = *Cov*(*X*_{k},*X*_{ − k})· *Cov*(*X*_{ − k})^{ − 1} and *α*_{k} = *μ*_{k} − *B*_{k}*μ*_{ − k} (Fisk, 1970). As a result, if *X*_{k} depends on *X*_{i}·*X*_{j} (for *i*,*j*≠*k*), for example, a GGM is not sufficient. This limitation is an important one since interactions are common in biological studies of individual complex traits (Shiozawa *and others*, 2000; Wittenburg *and others*, 2001; Carlborg *and others*, 2003; Hanlon *and others*, 2006), studies of groups of metabolites (Wentzell *and others*, 2007), and high-throughput studies of gene expression (Kirst *and others*, 2005; Brem and Kruglyak 2005; West *and others*, 2007).

In practice, one often proceeds with GGM estimation and inference in the presence of interactions. As we demonstrate in Section 2, doing so can lead to incorrect inference regarding dependence. Inferring the correct dependence structure is challenging when interactions are present, since an analogue of the paCor matrix that completely determines the dependence structure is not available, and dependencies can involve many nodes. The difficulty is exacerbated for studies with a large number of nodes (*p*) and relatively small sample size (*n*).

Cox and Wermuth (1994) recognized this problem. In an investigation of the possible reasons for departures from MVN, they focused on the presence of pairwise interactions, noting concern “that the reduction of the observations to covariance matrices overlooks important features of the dependencies of intrinsic interest.” They proposed an approach to identify the interactions that evaluates all possible linear regression models among the variables considered allowing for single pairwise interaction terms. Interactions with large t-statistics are identified as important and considered for further empirical investigation. Although useful when few nodes are available, the approach cannot be easily scaled to cases with large *p*, and is not possible when *n*<*p*.

As in Cox and Wermuth (1994), we too consider the special case of non MVN where the full conditional distributions are Gaussian, but the joint is not, due to pairwise interactions. Within this context, we propose a computationally efficient approach that uses GGM estimation techniques to identify variables affected by interactions and thereby identify regions in a resulting graph where edges may not imply direct or linear dependencies. Our approach is presented in Section 3, with much of the theoretical support relegated to supplementary material available at *Biostatistics* online (http://www.biostatistics.oxfordjournals.org). The method relies on estimation of multiple paCor matrices, and as a result is only slightly more computationally demanding than traditional GGM reconstruction. Simulation studies in Section 4 show that the approach identifies interactors and main effects with high sensitivity for moderate sample sizes, including cases where *n*<*p*, while controlling the false discovery rate (FDR). Further evaluation is given in Section 5, where we identify novel interactors using data from a breast cancer study.

Let **X** = (*X*_{1},…,*X*_{p}) denote *p* random variables. The paCor of *X*_{i} and *X*_{j} is defined as the correlation of the residuals obtained after linearly regressing each of the 2 on the remaining *p* − 2 variables. In the case of GGMs, where **X** is MVN, the paCor matrix completely specifies the dependence structure among the variables: the paCor between 2 nodes is 0 if and only if the 2 nodes are conditionally independent given all other variables. If **X** is not MVN, this interpretation of the paCor matrix does not hold.

An interesting, well-studied, and practically relevant case of non-MVN concerns the situation in which all full conditional distributions are normal, but the joint is not MVN, due to pairwise interactions. Consider 2 simple examples where *X*_{1}, *X*_{2}, and *ϵ* are independent normals (more general examples are provided in supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org); see Remark 1). For the first case, let *p* = 3 and *X*_{3} = *a*_{1}*X*_{1} + *a*_{2}*X*_{2} + *b**X*_{1}·*X*_{2} + *ϵ*. Then

which is 0 iff *a*_{1} + *b**E*(*X*_{2}) = 0. In this case, the edge between *X*_{1} and *X*_{3} is not detected in a GGM, even though the 2 are conditionally dependent. The precise constraint on the parameters may ensure that this case rarely happens in practice (see Remark 1 in supplementary material (http://www.biostatistics) for the exact constraint on parameters in more general settings).

As a more important example, let *p* = 4 and *X*_{k} = *a*_{1k}*X*_{1} + *a*_{2k}*X*_{2} + *b*_{k}*X*_{1}·*X*_{2} + *ϵ*_{k}, where *X*_{1}, *X*_{2}, *ϵ*_{k} are pairwise independent, for *k* = 3,4. A simple calculation shows that:

which is nonzero iff both *b*_{3} and *b*_{4} are nonzero. In other words, an edge would be detected between *X*_{3} and *X*_{4} even though they are conditionally independent given *X*_{1} and *X*_{2}.

In short, the one-to-one relationship between nonzero paCors and conditional dependencies that exists in a GGM breaks down when multiplicative effects are present. Frequently, incorrect inference will involve declaring a pair of variables conditionally dependent when in fact they are not (as in our second example). The goal of our method is to identify regions in an estimated GGM where the presence of an edge might not represent direct dependence as assumed, but rather is due to an interaction.

As in the case of GGM reconstruction, our method is also based on paCor estimates. A key element is recognizing that interactions have a marginal linear effect and, as a result, information about their presence can be obtained from the paCor matrix. More precisely, if *X*_{3} = *a*_{1}*X*_{1} + *a*_{2}*X*_{2} + *b**X*_{1}·*X*_{2} + *ϵ* as in the previous section, then

(3.1)

except for very particular coefficient configurations. This motivates the first step of the method which consists of identifying nodes that are potentially affected by interactors; these are the nodes that have at least 2 nonzero paCor coefficients (one corresponding to each potential interactor).

An appended matrix is then constructed for each such node, consisting of its significant associations (detected in the first step) as well as all possible pairwise interactions. PaCors are then reestimated and used to identify true interactors. For example, given expression (3.1), *X*_{3} would be a node identified in step 1 as potentially affected by an interaction, and an appended matrix for *X*_{3} would be constructed with columns {*X*_{1},*X*_{2},*X*_{1}·*X*_{2}}. The construction of this matrix is motivated by noting that

(for a more general statement, see Claim 4 in supplementary material (http://www.biostatistics.oxfordjournals.org)). The detailed steps of the proposed approach follow for a more general setting.

For variables *X*_{1},…,*X*_{p}, identify the nonzero paCor coefficients. Several approaches are now available for estimating the paCor matrix for sparse graphs when *n* <<*p* (Schäfer and Strimmer, 2005a, 2005b; Li and Gui, 2006; Meinshausen and Bühlman, 2006; Yuan and Lin, 2007; Friedman *and others*, 2008). We rely on the method of Schäfer and Strimmer (2005a, 2005b) as it provides an estimated paCor matrix (using the optimal shrinkage estimator of Ledoit and Wolf (2003)) as well as an empirical Bayes method to identify paCors at a desired FDR. The Schäfer and Strimmer approach is implemented in the R packages GeneNet 1.2.2 and fdrtool 1.2.5, available from http://cran.r-project.org/doc/packages/.

Consider the variables that have at least 2 nonzero entries in the paCor matrix. Let *X*_{i} denote such a variable and *X*_{i1},…,*X*_{ik}(*k* ≥ 2) the corresponding variables having nonzero paCors with *X*_{i}. Form a new data matrix *M*_{i} that includes these variables and all 2-way interactors, namely,

Estimate the paCor matrix of data *M*_{i}, for each variable *X*_{i} associated with at least 2 other variables. Then, *X*_{is}·*X*_{is + t} with *s* ≤ *k* − 1,*t* ≤ *k* − *s*, is declared an interactor for *X*_{i} if the following inequality holds:

This step is justified by the following theorem:

THEOREM Let = {*Y*_{1},…,*Y*_{K}} be a set of *K* independent variables, *K* ≥ 2, with *E*(*Y*_{i}) = *μ*_{i} and Var(*Y*_{i}) = *σ*_{i}^{2}, *i*{1,…,*K*} and assume that *Z* = ∑_{i = 1}^{K}*a*_{i}*Y*_{i} + ∑_{1 ≤ i < j ≤ K}*b*_{ij}*Y*_{i}·*Y*_{j} + *ϵ*, with *ϵ*~*N*(0,*σ*_{ϵ}^{2}), independent from all *Y*_{i}s. Let = {*Y*_{ij} = *Y*_{i}·*Y*_{j}|1 ≤ *i*<*j* ≤ *K*} denote the set of all possible interactors. Then the following holds:

(3.2)

A detailed proof of this result is provided in supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org). The main idea is to compute paCor(*Z*,*Y*_{i}·*Y*_{j}|,\*Y*_{i}·*Y*_{j}|), paCor(*Z*,*Y*_{i}|\*Y*_{i},) and paCor(*Z*,*Y*_{j}|\*Y*_{j},) and show that they depend directly on *b*_{ij}, *a*_{i} and *a*_{j}, respectively. This is achieved by using a recursive formula (see formula (A.1) from the supplementary material available at Biostatistics online, http://www.biostatistics.oxfordjournals.org) to explicitly compute the above paCors.

The theorem indicates that our approach can be used to identify interactors when the effect *b*_{ij} is large enough, as indicated by the inequality (3.2). We show in Section 5 that this restriction is not prohibitive in practice, as a number of interactors are identified in the case study considered.

Draw the graph. The resulting graph consists of the nodes (in white; isolated nodes are not shown) and edges identified in Step 1. Additional nodes (in gray) and edges are added to highlight the interactors identified in Step 3. Edges are not removed in any of the above steps. As a result, the graph augments the estimated GGM by identifying subgraphs where edges might be due to linear dependencies, interactions, or both.

To investigate the performance of the proposed approach, we have conducted computer simulations for networks of varying degrees of sparseness, heritability (i.e. the proportion of variability explained by the model), and sample sizes. It is important to note that most of the existing GGM methods for gene network reconstruction rely on the sparsity assumption of the paCor matrix (i.e. few nonzero entries), indicating that the variation in the expression pattern of a gene is explained by a small subset of other genes. As our method depends on estimating the paCor matrix, we too consider sparse networks for our simulation studies. This assumption is practically realistic, as sparsity has been observed in many studies (Barabási, 2004; Gardner *and others,* 2003; Jeong *and others*, 2001), including the breast cancer study presented in the Section 5.

As discussed previously, a GGM is determined by its corresponding paCor matrix; as a result, simulations of such networks involve simulating from a specified paCor matrix (Schäfer and Strimmer, 2005; Li and Gui, 2006). As our work involves interactions and thus departs from the MVN assumption, simulating from the paCor matrix is not sufficient. Instead, data sets were generated as follows:

- Generate N independent Gaussians
*X*_{1}~*N*(*μ*_{1},*σ*_{1}^{2}),…,*X*_{N}~*N*(*μ*_{N},*σ*_{N}^{2}). - Generate M nodes (
*M*<*N*) according to the models:with*p*_{1}+*q*_{1}+*p*_{2}+*q*_{2}=*M*and*k*_{i}<*N*− 2,*i*= 1,…,*M*. Variables*X*_{1},…,*X*_{N},*Y*_{1},…,*Y*_{M}represent the network nodes. We consider 2 scenarios

- I.
*N*= 42,*M*= 8,*p*_{1}= 3,*q*_{1}= 1,*p*_{2}= 4,*q*_{2}= 0, - II.
*N*= 80,*M*= 20,*p*_{1}= 8,*q*_{1}= 2,*p*_{2}= 8,*q*_{2}= 2.

For each scenario, we set *σ*_{ϵ} = 1, *σ*′_{i}*s* = 1, *i* = 1,…,*N*, and choose appropriate values for the remaining parameters such that there is a balanced proportion of variables with heritability values covering the interval [0.4,0.8], similar to what was observed in the case study network. The level of sparseness was 2% for scenario I and 1% for scenario II. Moreover, the coefficients of the interaction terms are chosen to satisfy inequality (3.2) in Section 3, based upon which interactions are declared.

For simulation study I (50 nodes), sample sizes ranging from 30 to 110 were considered, while for simulation study II (100 nodes), sample sizes ranging from 50 to 150 were considered, both in 20-unit increments. For each sample size and simulation scenario, 2000 data sets were generated. The proposed approach was applied to each simulated data set using an FDR threshold of 10% in step 1 of the method, when selecting variables with at least 2 associations. Figure 1 shows the average power and FDR over 2000 simulations, for main effects and interactors and the sample sizes considered. Average specificity was greater than 0.95 in all cases, and is therefore not shown.

Average power to detect main effects (solid line) and interactions (dashed line) for scenarios I and II are shown in the upper and lower left panels, respectively. Average FDR over main effects (solid line) and interactions (dashed line) for scenarios **...**

The method detects additive effects with good sensitivity even for small sample sizes: for scenario I, the average power is .4 when the sample size is only 40; for scenario II, when the sample size is 70, the average power is approximately 0.5. The power increases to 80% for both scenarios for sample sizes larger than 110. Interactors are also detected with reasonable sensitivity when *n*<*p*, exceeding 36% and 50% for sample sizes of 70 and 90, respectively, in simulation study II. In both simulation scenarios, sensitivity increases with the sample size as expected.

The false discovery proportions for additive effects as well as interactions are well controlled, being less than the nominal level of 0.1 in both simulation settings and for all sample sizes. In summary, these results support 2 important points: (i) if the joint distribution departs from MVN due to the presence of interactors, the GGM estimation technique considered is still functioning well in detecting the true linear effects with good sensitivity and specificity, and (ii) our approach performs well in detecting interactors with sufficient signal, while controlling the FDR.

We applied our approach to microarray data from the Duke Breast Cancer SPORE frozen tissue bank (West *and others*, 2001), available at http://data.cgt.duke.edu/WEST/PNASCel.zip. The data consist of 49 tissue samples and 7129 expression measurements, recorded using Affymetrix hu6800 chips. Preprocessing was done using robust multiarray average (RMA) (Irizarry *and others*, 2003) to obtain a single, normalized score of expression for each probe set on each array. When multiple probe sets were used for the same gene, we selected the probe set measurement with the highest RMA average across the 49 samples.

Since it is difficult to visualize, interpret, and understand a network constructed from over 7000 nodes, we chose to focus on networks constructed with genes involved in similar biological pathways. We used library hu6800 to determine Gene Ontology (GO) annotations (http://www.geneontology.org) for our data set. GO refers to three structured vocabularies, or ontologies, that describe gene products in terms of their associated biological processes (BP), cellular components (CC) and molecular functions (MF) in a species-independent manner. In GO, an annotated gene is assigned to one or more GO terms, each of which describes the BP, CC, or MF associated with the gene. We clustered the breast cancer genes into 7810 GO terms, of which 686 contained at least 50 genes. The 182 GO terms containing between 100 and 200 genes were the focus of our investigation.

Among them, the GO term GO:0008284 (positive regulation of cell proliferation) contains 170 genes. Our method applied to this gene set gives the network shown in Figure 2, consisting of 77 edges of which 12 represent interacting genes. To control the proportion of possibly spurious findings, the FDR threshold was set at 0.1, as in the simulation studies. By accounting for interactors, the reconstructed network provides additional insight into the nature of the direct associations (additive or multiplicative effects) detected.

Network reconstructed from 170 genes with GO annotation “positive regulation of cell proliferation,” GO:0008284. A total of 77 edges were obtained. Unconnected nodes are not shown. White nodes represent linear effects; gray nodes represent **...**

For example, our approach revealed that gene CDK4 (cyclin-dependent kinase 4; upper right corner) is associated with ERBB2 (HER2/neu, human epidermal growth factor receptor 2), FGG (fibrinogen gamma chain), FGB (fibrinogen beta chain) and CAPN1 (calpain 1), as well as the three other interactions ERBB2·FGG, ERBB2·FGB, FGB·FGG. To further investigate this dependence, we fitted linear regression models with and without the interaction terms. Each interactor was found significant (*p*-value<0.05) by the respective linear models. Moreover, the model with interaction terms explains 76% of the variability for CDK4, while the additive model accounts for only 53%. The interactive model is favored also by the Bayesian information criterion (BIC) (Table 1) and the presence of interactors is further reinforced in panels (a)–(c) of Figure 3, which show a nonrandom pattern when the residuals of the additive model for CDK4 are plotted against each interactor.

Residuals of the additive models for CDK4 (panels (a–c), SCG2 (panel d), versus their corresponding interactors.

As a second example, consider the subgraph around SCG2 (secretogranin II), a gene that encodes an important neuroendocrine secretory protein. Our results suggest that SCG2 depends directly on CAPN1, PDGFA, IGF1R, and the interactor PDGFA·IGF1R. As above, the interactor is highly significant, with an estimated coefficient of 2.05 (SE = 0.42; *p*-value<0.0001) and is supported by the nonrandom pattern of residuals (Figure 3, panel d). Moreover, this interaction helps explain a substantial proportion of the variability for SCG2: CAPN1, PDGFA, and IGF1R account for 28.1% of the genetic variability, increasing to 53.4% when the interaction term PDGFA·IGF1R is included. Finally, the BIC is considerably lower for the interactive model (Table 1).

The percentage of variance and BIC explained by / determined by the additive models (AM) versus the interactive effects models (IM).

Similar results were uncovered for most interactors identified. Overall, 8 of the 12 interactors identified in the full network were found significant based on local linear regression models (*p*-value<0.05), and heritabilities were significantly increased when accounting for interactors, as reflected by Table 1. Moreover, the interaction models resulted in comparably lower BICs. Networks built within several other GO terms from the 182 annotation list exhibit similar properties as the network described above.

Properly reconstructed biological networks have the potential to elucidate many important questions in biology and, with recent technological advances, the measurements required for reconstruction are now largely available. As a result, network reconstruction efforts abound. The GGM framework is often used since computation and interpretation are straightforward. However, GGMs do not allow for interacting nodes due to the underlying MVN assumption. As we show here for 2 simple cases, proceeding with GGM estimation and interpretation in the presence of interactions can result in both false-positive and false-negative identifications of conditional dependence.

Identifying the underlying graph structure in the presence of interactors is a challenging problem. We have considered the specific case where nodes are conditionally normal, but the joint is not MVN due to the presence of interactions. This is a difficult problem as an analogue of the paCor matrix that uniquely determines linear and higher order dependencies does not exist. Deriving such a measure for this family of densities is a challenging open question. In the meantime, the proposed approach should prove useful in identifying interactions and constructing informative graphs that provide good insight into the nature of dependencies among genes. Specifically, the graph provided by our approach augments an estimated GGM by identifying subgraphs where edges might be due to linear dependencies, interactions, or both. A careful consideration of such subgraphs can provide insight into the nature of the dependencies.

In practice, the graph identified depends on a number of factors, and it is worthwhile to investigate how moderate variations impact the results. An obvious value to consider is the FDR threshold, which in part determines the number of candidate interactors detected in the first step of the approach. We considered an FDR threshold of 0.1 in the simulations and case study. However, it could certainly be the case that for some data sets, a threshold of 0.1 will yield an excess of edges and interactors that are difficult to interpret. Alternatively, there could be data sets for which a slightly higher FDR cutoff is required. We recommend varying the FDR threshold and checking that edges calculated at the more conservative threshold are largely preserved when the threshold is increased. That was the case for the Duke study presented here (results not shown).

Another factor to consider is whether or not variables are scaled. It is well known that predictors that are additive on one scale may exhibit an interaction when a different scale is used, and conversely (Frankel and Schork, 1996; Greenland and Rothman, 1998; Cordell, 2002). In the context of our method, some of the coefficients important for identification of interactions (step 3 in Methods) are not invariant to scaling and, as a result, the interactors identified on one scale might differ from those obtained on another. We did not scale the variables here, noting as in Frankel and Schork (1996) that doing so may facilitate the identification of interactors.

The number of edges and interactions detected will also depend on the number of nodes initially considered and on the method used for node selection. Our approach relies on estimation of a paCor matrix, and since good approaches now exist for estimation in the case of *n*<*p*, the number of nodes considered is no longer restricted by the sample size. Nevertheless, a choice needs to be made, and the method used for feature selection will impact the number of interactors identified.

Our case studies considered nodes defined from GO categories; and in most cases, interactions among some subset of the nodes were found. Results such as these have important implications for biologists. Certainly membership of a collection of genes in the same GO category provides some evidence that the genes are somehow related. Ideally, a biologist would like to know which ones are directly related, and how. An experimental consideration of all possible associations is not feasible, and as a result typically the experiments done to investigate a collection of genes focus on genes in isolation or pairs. Our approach identifies which genes in a collection are directly dependent, and therefore should be studied together. The approach also highlights genes where the dependence might involve an interaction. For example, our breast cancer case study of GO:0008284 found CDK4 to be directly associated with ERBB2, FGG, FGB, and CAPN1. There is much information about many of these genes in isolation: CDK4 overexpression is associated with high tumor cell proliferation in breast cancer (An *and others*, 1999); ERBB2 (HER2/neu) is well known for its role in the pathogenesis of breast cancer and as a target of treatment (Lane *and others*, 2000; Ross *and others*, 2003; Menard *and others*, 2005); and CAPN1 and other CAPN (Calpains) genes have been tied to different human diseases (Huang and Wang, 2001). There are also studies that consider pairs of these genes: Nahta *and others* (2002), show that CDK4 activity is specifically important for ERBB2-mediated transformation and reciprocally regulates ERBB2 levels. Our results suggest that studies of CDK4 should involve at least ERBB2, FGG, FGB, and CAPN1; and that the relationships measured should account for the potential interactions of ERBB2 with FGG, ERBB2 with FGB, and FGG with FGB. Indeed, a detailed study of this group might consider the 3-way interaction suggested by the 3 pairwise interactions we found. Identifying 3-way interactions was beyond the scope of our study.

In summary, we have presented an approach to identify statistical interactions in networks. The main advantages compared with other network reconstruction approaches that allow for nonlinear relationships among nodes (Friedman *and others,* 2000) are that the approach makes relatively few, but testable, assumptions, does not rely on prior specification, and is computationally efficient, requiring little increase in computation over traditional GGM fitting. Extensions to more general networks are underway and should prove useful particularly for network reconstructions involving continuous phenotype and discrete genotype data.

National Institute of General Medical Sciences (R01GM076274) to C.K.

Supplementary Material is available at http://www.biostatistics.oxfordjournals.org.

We wish to thank Michael Newton for comments that greatly broadened the presentation of this work. We also thank 2 referees for their helpful suggestions. *Conflict of Interest*: None declared.

- An H, Beckmann MW, Reifenberger G, Bender H, Niederacher D. Gene amplification and overexpression of CDK4 in sporadic breast carcinomas is associated with high tumor cell proliferation. American Journal of Pathology. 1999;154:113–118. [PubMed]
- Barabási AL. Network biology: understanding the cell's functional organization. Nature Reviews Genetics. 2004;5:101–113. [PubMed]
- Brem RB, Kruglyak L. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences of the United States of America. 2005;102:1572–1577. [PubMed]
- Butte AS, Kohane IS. Relevance networks: a first step toward finding genetic regulatory networks within microarray data. In: Parmigiani G, Garett ES, Irizarry RA, Zeger SL, editors. The Analysis of Gene Expression Data. New York, NY: Springer; 2003. pp. 428–446.
- Carlborg O, Kerje S, Schtz K, Jacobsson L, Jensen P, Andersson L. A global search reveals epistatic interaction between QTL for early growth in the chicken. Genome Research. 2003;13:413–421. [PubMed]
- Cordell HJ. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Human Molecular Genetics. 2002;11:2463–2468. [PubMed]
- Cox DR, Wermuth N. Tests of linearity, multivariate normality and the adequacy of linear scores. Applied Statistics. 1994;43:347–355.
- Fisk PR. A note on a characterization of the multivariate normal distribution. Annals of Statistics. 1970;41:486–494.
- Frankel WN, Schork NJ. Who's afraid of epistasis? Nature Genetics. 1996;14:371–373. [PubMed]
- Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. [PMC free article] [PubMed]
- Friedman N, Linial M, Nachman I, Pe'er D. Using Bayesian networks to analyze gene expression data. Journal of Computational Biology. 2000;7:601–620. [PubMed]
- Gardner TS, di Bernardo D, Lorenz D, Collins J. Inferring genetic networks and identifying compound mode of action via expression profiling. Science. 2003;301:102–105. [PubMed]
- Greenland S, Rothman KJ, editors. Modern Epidemiology. 2nd edition. Philadelphia, PA: Lippincott-Raven Publishers; 1998. Concepts of interaction; pp. 329–342.
- Hanlon P, Lorenz WA, Shao Z, Harper J, Galecki AT, Miller R, Burke DT. Three-locus and four-locus QTL interactions influence mouse insulin-like growth factor-I. Physiological Genomics. 2006;26:46–54. [PMC free article] [PubMed]
- Hartemink A. Reverse engineering gene regulatory networks. Nature Biotechnology. 2005;23:554–555. [PubMed]
- Huang Y, Wang KW. The calpain family and human disease. Trends in Molecular Medicine. 2001;7:355–362. [PubMed]
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acid Research. 2003;31:e15. [PMC free article] [PubMed]
- Jeong H, Mason SP, Barabási AL, Oltvai ZN. Lethality and centrality in protein networks. Nature. 2001;411:41–42. [PubMed]
- Keller MP, Choi YJ, Wang P, Davis DB, Rabaglia ME, Oler AT, Stapleton DS, Argmann C, Schueler K, Edwards S. and others. A gene expression network model of type 2 diabetes links cell cycle regulation in islets with diabetes susceptibility. Genome Research. 2008;18:706–716. [PubMed]
- Kirst M, Basten CJ, Myburg AA, Zeng ZZ, Sederoff RR. Genetic architecture of transcript level variation in differentiating xylem of an eucalyptus hybrid. Genetics. 2005;169:2295–2303. [PubMed]
- Lane H, Beuvink I, Motoyama AB, Daly JD, Neve RM, Hynes NE. ErbB2 Potentiates breast tumor proliferation through modulation of p27Kip1-Cdk2 complex formation: receptor overexpression Ddoes not determine growth dependency. Molecular and Cellular Biology. 2000;20:3210–3223. [PMC free article] [PubMed]
- Lauritzen S. Graphical Models. Oxford: Oxford University Press; 1996.
- Ledoit O, Wolf M. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis. 2003;88:365–411.
- Li H, Gui J. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006;2:302–317. [PubMed]
- Ma S, Gong Q, Bohnert HJ. An Arabidopsis gene network based on the graphical Gaussian model. Genome Research. 2007;17:1614–1625. [PubMed]
- Markowetz F, Spang R. Inferring cellular networks—a review. BMC Bioinformatics. 2007;8(Suppl. 6):S5. [PMC free article] [PubMed]
- Matsuno T, Tominaga N, Arizono K, Iguchi T, Kohara Y. Graphical gaussian modeling for gene association structures based on expression deviation patterns induced by various chemical stimuli. IEICE Transactions of Information and Systems. 2006;E89-D:1563–1574.
- Meinshausen N, Bühlman P. High dimensional graphs and variable selection with the Lasso. Annals of Statistics. 2006;34:1436–1462.
- Menard S, Casalini P, Campiglio M, Pupa SM, Tagliabue E. Role of HER2/neu in tumor progression and therapy. Cellular and Molecular Life Sciences. 2005;61:2965–2978. [PubMed]
- Nahta R, Iglehart JD, Kempkes B, Schmidt EV. Rate-limiting effects of cyclin D1 in transformation by ErbB2 predicts synergy between herceptin and flavopiridol. Cancer Research. 2002;62:2267–2271. [PubMed]
- Ross JS, Fletcher JA, Linette GP, Stec J, Clark E, Ayers M, Symmans WF, Pusztai L, Bloom KJ. The Her-2/neu gene and protein in breast cancer 2003: biomarker and target of therapy. Oncologist. 2003;8:307–325. [PubMed]
- Schäfer J, Strimmer K. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics. 2005a;21:754–764. [PubMed]
- Schäfer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology. 2005b;4:1–30. [PubMed]
- Shiozawa M, Provoost AP, Van Dokkum RP, Majewski RR, Jacob HJ. Evidence of gene-gene interactions in the genetic susceptibility to renal impairment after unilateral nephrectomy. Journal of the American Society of Nephrology. 2000;11:2068–2078. [PubMed]
- Stuart A, Ord JK, Arnold S. Kendall's Advanced Theory of Statistics. 6th edition. Volume 2A. New York, NY: Oxford University Press, Inc; 1999.
- Toh H, Horimoto K. Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics. 2002;18:287–297. [PubMed]
- Wentzell MA, Rowe HC, Hansen BG, Ticconi C, Halkier BA, Kliebenstein DJ. Linking metabolic QTL with network and cis-eQTL controlling biosynthetic pathways. PLoS Genetics. 2007;162 [PubMed]
- Werhli AV, Grzegorczyk M, Husmeier D. Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks. Bioinformatics. 2006;22:2523–2531. [PubMed]
- West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2001;98:11462–114677. [PubMed]
- West MA, Kim K, Kliebenstein DJ, van Leeuwen H, Michelmore RW, Doerge RW, St. Clair DA. Global eQTL mapping reveals the complex genetic architecture of transcript-level variation in Arabidopsis. Genetics. 2007;175:1441–1450. [PubMed]
- Whittaker J. Graphical Models in Applied Multivariate Statistics. New York: Wiley; 1990.
- Wille A, Zimmermann P, Vranova E, Farholz A, Laule O, Bleuler S, Hennig L, Preli A, von Rohr P, Thiele L. and others. Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biology. 2004;5 [PMC free article] [PubMed]
- Wittenburg H, Lammert F, Wang D, Churchill GA, Li R, Bouchard G, Carey MC, Paigen B. Interacting QTLs for cholesterol gallstones and gallbladder mucin in AKR and SWR strains of mice. Physiological Genomics. 2001;8:67–77. [PubMed]
- Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94:19–35.

Articles from Biostatistics (Oxford, England) are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |