Home | About | Journals | Submit | Contact Us | Français |

**|**PLoS Comput Biol**|**v.4(8); 2008 August**|**PMC2446438

Formats

Article sections

- Abstract
- Author Summary
- Introduction
- Results
- Discussion
- Materials and Methods
- Supporting Information
- References

Authors

Related links

PLoS Comput Biol. 2008 August; 4(8): e1000117.

Published online 2008 August 15. doi: 10.1371/journal.pcbi.1000117

PMCID: PMC2446438

Satoru Miyano, Editor^{}

Department of Human Genetics, David Geffen School of Medicine, and Department of Biostatistics, School of Public Health, University of California, Los Angeles, California, United States of America

University of Tokyo, Japan

* E-mail: ude.alcu.tendem@htavrohs

Conceived and designed the experiments: SH. Analyzed the data: JD. Wrote the paper: SH JD.

Received 2007 October 12; Accepted 2008 June 9.

Copyright Horvath, Dong.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.

This article has been cited by other articles in PMC.

The merging of network theory and microarray data analysis techniques has spawned a new field: gene coexpression network analysis. While network methods are increasingly used in biology, the network vocabulary of computational biologists tends to be far more limited than that of, say, social network theorists. Here we review and propose several potentially useful network concepts. We take advantage of the relationship between network theory and the field of microarray data analysis to clarify the meaning of and the relationship among network concepts in gene coexpression networks. Network theory offers a wealth of intuitive concepts for describing the pairwise relationships among genes, which are depicted in cluster trees and heat maps. Conversely, microarray data analysis techniques (singular value decomposition, tests of differential expression) can also be used to address difficult problems in network theory. We describe conditions when a close relationship exists between network analysis and microarray data analysis techniques, and provide a rough dictionary for translating between the two fields. Using the angular interpretation of correlations, we provide a geometric interpretation of network theoretic concepts and derive unexpected relationships among them. We use the singular value decomposition of module expression data to characterize approximately factorizable gene coexpression networks, i.e., adjacency matrices that factor into node specific contributions. High and low level views of coexpression networks allow us to study the relationships among modules and among module genes, respectively. We characterize coexpression networks where hub genes are significant with respect to a microarray sample trait and show that the network concept of intramodular connectivity can be interpreted as a fuzzy measure of module membership. We illustrate our results using human, mouse, and yeast microarray gene expression data. The unification of coexpression network methods with traditional data mining methods can inform the application and development of systems biologic methods.

Similar to natural languages, network language is ever evolving. While some network terms (concepts) are widely used in gene coexpression network analysis, others still need to be developed to meet the ever increasing demand for describing the system of gene transcripts. There is a need to provide an intuitive geometric explanation of network concepts and to study their relationships. For example, we show that certain seemingly disparate network concepts turn out to be synonyms in the context of coexpression modules. We show how coexpression network language affects our understanding of biology. For example, there are geometric reasons why highly connected hub genes in important coexpression modules tend to be important, and why hub genes in one module cannot be hubs in another distinct module. We provide a short dictionary for translating between microarray data analysis language and network theory language to facilitate communication between the two fields. We describe several examples that illustrate how the two data analysis fields can inform each other.

Many biological networks share topological properties. Common global properties include modular organization [1],[2], the presence of highly connected hub nodes, and approximate ‘scale free topology’ [3],[4]. Common local topological properties include the presence of recurring patterns of interconnections (‘network motifs’) in regulation networks [5]–[7].

One goal of this article is to describe existing and novel network concepts (also known as network statistics or indices [8]) that can be used to describe local and global network properties. For example, the clustering coefficient [9] is a network concept, which measures the cohesiveness of the neighborhood of a node. We are particularly interested in network concepts that are defined with regard to a ‘gene significance measure’. Gene significance measures are of great practical importance since they allow one to incorporate external gene information into the network analysis. In functional enrichment analysis, a gene significance measure could indicate pathway membership. In gene knock-out experiments, gene significance could indicate knock-out essentiality. We study gene significance measures since a microarray sample trait (e.g., case control status) gives rise to a statistical measure of gene significance. For example, the Student *t*-test of differential expression leads to a gene significance measure. Many traditional microarray data analysis methods focus on the relationship between the microarray sample trait and the gene expression data. For example, gene filtering methods aim to find a list of (differentially expressed) genes that are significantly associated with the microarray sample trait; another example are microarray-based prediction methods that aim to accurately predict the sample trait on the basis of the gene expression data.

Gene expression profiles across microarray samples can be highly correlated and it is natural to describe their pairwise relations using network language. Genes with similar expression patterns may form complexes, pathways, or participate in regulatory and signaling circuits [10]–[12]. Gene coexpression networks have been used to describe the transcriptome in many organisms, e.g., yeast, flies, worms, plants, mice, and humans [13]–[23]. Gene coexpression network methods have also been used for typical microarray data analysis tasks such as gene filtering [19], [24]–[26] and outcome prediction [27],[28].

While the utility of network methods for analyzing microarray data has been demonstrated in numerous publications, the utility of microarray data analysis techniques for solving network theoretic problems has not yet been fully appreciated. One goal of this article is to show that simple geometric arguments can be used to derive network theoretic results if the networks are defined on the basis of a correlation matrix.

Although many of our network concepts will be useful for general networks, we are particularly interested in gene coexpression networks (also known as association-, influence-, relevance-, or correlation networks). Gene coexpression networks are built on the basis of a gene coexpression measure. The network nodes correspond to genes—or more precisely to gene expression profiles. The *i*th gene expression profile *x _{i}* is a vector whose components report the gene expression values across

Using a thresholding procedure, this coexpression similarity is transformed into a measure of connection strength (adjacency). An unweighted network adjacency *a _{ij}* between gene expression profiles

(1)

where τ is the “hard” threshold parameter. Thus, two genes are linked (*a _{ij}*=1) if the absolute correlation between their expression profiles exceeds the (hard) threshold τ. Hard thresholding of the correlation leads to simple network concepts (e.g., the gene connectivity equals the number of direct neighbors) but it may lead to a loss of information: if τ has been set to 0.8, there will be no link between two genes if their correlation equals 0.799. To preserve the continuous nature of the coexpression information, one could simply define a weighted adjacency matrix as the absolute value of the gene expression correlation matrix, i.e., [

(2)

with *β*≥1. This soft thresholding approach leads to a weighted gene coexpression network. We present empirical results for weighted and unweighted networks in the main text, Text S1, Text S2, and Text S3.

Since humans are organized into social networks, social network analogies should be intuitive to many readers. Therefore, we will refer to the following ‘affection network’ throughout this article. Assume that *n* individuals filled out an interest questionnaire, which was used to define a pairwise similarity score *s _{ij}*. For convenience, we assume that the similarity measure takes on values between 0 and 1. Our definition of the affection network is based on the following assumption: the more similar the interests between two individuals, the more affection they feel for each other. More specifically, we assume that the affection (adjacency)

(3)

This is equivalent to our soft thresholding approach *a _{ij}*=

Many network applications use at least one gene significance measure. Abstractly speaking, we define a gene significance measure as a function *GS* that assigns a nonnegative number to each gene; the higher *GS _{i}* the more

(4)

Although any power *β* could be used in Equation 4, we use the same power as in Equation 2 to facilitate a simple geometric interpretation.

We find it convenient to express network quantities in terms of correlation coefficients since the correlation between two vectors can be interpreted as the cosine of the angle between them (measured in radians) if the vectors are scaled to have a mean of 0. Since the correlation is scale-invariant, i.e., cor(*ax _{i}*+

The network adjacency *a _{ij}* is a monotonically decreasing function of the angle

Since the trait-based gene significance measure *GS _{i}*=|cor(

As a motivational example, we study the pairwise correlations among 498 genes that had previously been found to form a sub-network related to mouse body weight. The microarray data measure the expression levels in multiple tissue samples (liver, adipose, brain, muscle) from male and female mice of an F2 intercross. Approximately 100 tissue samples are available for each gender/tissue combination. The biological significance of this subnetwork is described in [23],[26]. Here we focus on the mathematical and topological properties of the pairwise absolute correlations *a _{ij}*=|cor(

It is visually obvious that the heat maps and the cluster trees of different gender/tissue combinations can look quite different. Network theory offers a wealth of intuitive concepts for describing the pairwise relationships among genes that are depicted in cluster trees and heat maps. To illustrate this point, we describe several such concepts in the following. By visual inspection of Figure 1B, genes appear to be more highly correlated in liver than in adipose (a lot of red versus green color in the corresponding heat maps). This property can be captured by the concept of network density (defined below). The density of the female liver network is 0.39 while it is only 0.23 for the female adipose network. Another example for the use of network concepts is to quantify the extent of cluster (module) structure. In this example, branches of a cluster tree (Figure 1A) correspond to modules in the corresponding network. The cluster structure is also reflected in the corresponding heat maps: modules correspond to large red squares along the diagonal. Network theory provides a concept for quantifying the extent of module structure in a network: the mean clustering coefficient (defined below). The female liver, male liver and female brain networks have high mean clustering coefficients (mean *ClusterCoef*=0.42, 0.43, 0.41, respectively). In contrast, the female adipose, male adipose, and male brain networks have lower mean clustering coefficients (mean *ClusterCoef*=0.27, 0.27, 0.25, respectively). Difference in module structure may reflect true biological differences or they may reflect noise (e.g. technical artifacts or tissue contaminations).

As another example for the use of network concepts, compare the cluster tree of the female brain network with that of the male brain network. The cluster tree of the female network appears to be comprised of a single large branch, i.e., a highly connected hub gene at the tip of the branch forms the center in this network. In contrast, the cluster tree corresponding to the male brain network appears to split into multiple smaller branches, i.e., no single gene forms the center. To measure whether a highly connected hub gene forms the center in a network, one can use the concept of centralization (defined below). The female brain and male brain networks have centralization 0.34 and 0.21, respectively.

These examples illustrate that graph theory contains a wealth of network concepts that can be used to describe microarray data. But we will argue that microarray data analysis techniques can also be used to derive network theoretic results. For example, network theorists have long studied the relationship between gene significance and connectivity. Several network articles have pointed out that highly connected hub nodes are central to the network architecture [17], [29]–[32] but hub genes may not always be biologically significant [33]. To define a sample trait based gene significance measure (Equation 4), we define the gene significance of gene *i* as the absolute correlation between the gene expression profile *x _{i}* and body weight

We define network concepts for (weighted) undirected networks that can be represented by a symmetric adjacency matrix *A*=[*a _{ij}*], where 1≤

The *connectivity* (also known as degree) of the *i*th gene is defined by

(5)

In unweighted networks, the connectivity *k _{i}* equals the number of genes that are directly linked to gene

The *maximum connectivity* is defined as

(6)

The *scaled connectivity K _{i}* of the

(7)

By definition, 0≤*K _{i}*≤1. Note that we distinguish the scaled from the unscaled connectivity by using an upper case “

*Social Network Interpretation of the Connectivity:* For the aforementioned affection network (Equation 3), assume that the affection (adjacency) *a _{ij}* equals 1 if two individuals strongly like each other; it equals 0.5 if they are neutral towards each other, and it equals 0 if they strongly dislike each other. Then the scaled connectivity

*Potential Uses of the Connectivity:* The connectivity is the most widely used concept for distinguishing the nodes of a network. As described in the motivational example and detailed below, intramodular connectivity can be used to define a systems biologic gene screening strategy that keeps track of module membership information [24].

For weighted networks, we define the *maximum adjacency ratio* of gene *i* as follows

(8)

which is defined if *k _{i}*=

*Social Network Interpretation of the Maximum Adjacency Ratio: MAR _{i}*=1 suggests that the

*Potential Uses of the Maximum Adjacency Ratio:* Since *MAR _{i}*=1 for all genes in an unweighted network, the maximum adjacency ratio is only useful for weighted networks. The

In weighted coexpression networks, we find empirically that *MAR _{i}* is often highly correlated with the connectivity

The *network density* (also known as line density [35]) is defined as the mean off-diagonal adjacency and is closely related to the mean connectivity.

(9)

where *k*=(*k*
_{1},…,*k _{n}*) denotes the vector of connectivities and the function vector

*Social Network Interpretation of the Density:* The density measures the overall affection among individuals. A density close to 1 indicates that all individuals strongly like each other while a density of 0.5 suggests the presence of more ambiguous relationships.

*Potential Uses of the Density:* The density of genes in a subnetwork (e.g., a pathway) can be used to measure whether this sub-network is tight or cohesive. In our motivational mouse tissue example, we find that a network of genes has high density in liver tissue but low density in adipose tissue. The goal of many module detection methods is to find clusters of genes with high density.

The *network centralization* (also known as degree centralization [36]) is given by

(10)

The centralization is 1 for a network with star topology; by contrast, it is 0 for a network where each node has the same connectivity. A regular grid network such as a square has centralization 0.

*Social Network Interpretation of the Centralization:* The centralization of the affection network is close to 1, if one individual has loving relationships with all others who in turn strongly dislike each other. In contrast, a centralization of 0 indicates that all individuals are equally popular.

*Potential Uses of the Centralization:* While the centralization is a widely used measure in social network studies, it has only rarely been used to describe structural differences of metabolic networks [37]. As described in our motivational example, the centralization can be used to describe properties of cluster trees, see also [8].

The *network heterogeneity* measure is based on the variance of the connectivity. Authors differ on how to scale the variance [35]. We define it as the coefficient of variation of the connectivity distribution, i.e.

(11)

This heterogeneity measure is invariant with respect to multiplying the connectivity by a scalar.

*Social Network Interpretation of the Heterogeneity:* The heterogeneity can be used to measure the variation of popularity (connectivity) across the individuals.

*Potential Uses of the Heterogeneity:* Describing the reasons for and the meaning of the heterogeneity of complex networks has been the focus of considerable research in recent years [29],[38]. Many complex networks have been found to exhibit an approximate scale-free topology, which implies that these networks are very heterogeneous [3].

The *clustering coefficient* of gene *i* is a density measure of local connections, or “cliquishness” [9]. Specifically,

(12)

In unweighted networks, *ClusterCoef _{i}* equals 1 if and only if all neighbors of

*Social Network Interpretation of the Clustering Coefficient:* The higher the clustering coefficient of an individual, the higher is the affection among his friends. The clustering coefficient is zero if all of his friends strongly dislike each other.

*Potential Uses of the Clustering Coefficient:* As described in our motivational example, the mean clustering coefficient has been used to measure the extent of module structure present in a network. The relationship between the clustering coefficient and connectivity has been used to describe structural (hierarchical) properties of networks [1].

To measure the association between connectivity and gene significance, we propose the following measure of *hub gene significance*:

(13)

When *GS _{i}* is proportional to the scaled connectivity (

*Social Network Interpretation of the Hub Gene Significance:* Assume that the node significance measures the grade point average of the *i*th individual. Then the hub node significance can be used to assess whether there is a relationship between popularity (connectivity) and grade point average.

*Potential Uses of the Hub Gene Significance:* Several studies have shown that the relationship between connectivity and gene significance (i.e., the hub gene significance) carries important biological information. For example, in the analysis of yeast networks, highly connected hub genes were found to be essential for yeast survival and there is evidence that hub genes are preserved across species [17], [25], [29]–[32]. A detailed analysis shows that the positive relationship between connectivity and knockout essentiality cannot always be observed [33], i.e., the hub gene significance can be close to 0.

We define the *network significance measure* as the average gene significance of the genes:

(14)

*Social Network Interpretation of the Network Significance:* The network significance simply measures the average grade point average among the individuals.

*Potential Uses of the Network Significance:* We refer to the network significance of a module network as “module significance.” The module significance measure can be used to address a major goal of gene network analysis: the identification of biologically significant subnetworks or pathways.

We define the *centroid significance* as the gene significance of a suitably chosen representative node (centroid) in the network.

(15)

where *i.centroid* denotes the index associated with the centroid. A centroid can be defined in many different ways, e.g., based on connectivity or other centrality measures. In our applications, we define the centroid as the most highly connected gene in the network. If multiple genes attain the maximum connectivity, we define the centroid significance by their average gene significance.

We define the *centroid conformity* of the *i*th gene as the adjacency between the centroid and the *i*th gene

(16)

If multiple genes attain the maximum connectivity, we define the centroid conformity as their average adjacency with the *i*th gene.

*Social Network Interpretation of the Centroid Conformity:* In our affection network, we choose the most popular individual as centroid; then his or her grade point average is the centroid significance. The centroid conformity of the *i*th individual equals his or her affection (connection strength) with the most popular individual.

*Potential Uses of the Centroid Conformity:* Below, we will characterize coexpression networks for which the adjacency *a _{ij}* can be approximated by a product of the centroid conformities:

One of the many biological applications of gene coexpression networks is the identification of pathways (modules) and centrally located genes (referred to as module centroids). In our applications, we define highly connected intramodular hub genes as module centroids. Weighted gene coexpression network analysis (WGCNA, [19],[24]) can be considered a step-wise microarray data reduction technique, which starts from the level of thousands of genes, identifies clinically interesting gene modules, and finally represents the modules by their centroids. The module centric analysis alleviates the multiple testing problem inherent in microarray data analysis. Instead of relating thousands of genes to a sample trait, it focuses on the relationship between a few (usually less than 10) modules and the sample trait.

An outline of WGCNA is presented in Figure 3A. The module definition does not make use of *a priori* defined gene sets. Instead, modules are constructed from the expression data by using a tight clustering procedure. Although it is advisable to relate the resulting modules to gene ontology information to assess their biological plausibility, it is not required. Because the modules may correspond to biological pathways, focusing the analysis on modules (and corresponding centroids) amounts to a biologically motivated data reduction method. Intramodular hub genes are centrally located in the module and thus lend themselves as candidates for biomarkers. Examples of biological studies that show the importance of intramodular hub genes can be found reported in [23]–[25],[33],[39]. Because the expression profiles of intramodular hub genes are highly correlated (in our data, *r*>0.90), typically dozens of candidates result. Although these candidates are statistically equivalent, they may differ in terms of biological plausibility or clinical utility.

Roughly speaking, we define network modules as groups of highly interconnected genes. As detailed in Text S1, Text S2, Text S3, and in our online R tutorials, we use a hierarchical clustering procedure to identify modules (clusters) as branches of the resulting cluster tree. A common but inflexible branch cutting method uses a constant height cutoff value. Alternatively, dynamic branch cutting adaptively chooses cutting values depending on the shape of the branch [40]. Each module is assigned a unique color label (Figure 3B). Our branch cutting algorithm only assigns module colors to branches whose size exceeds a user-specified threshold parameter. In practice, it is advisable to vary the minimum module size and other branch cutting parameters to determine how the results are affected by different parameter choices. An iterative approach for choosing the parameters could be defined by optimizing the module significance. This module detection approach has led to biologically meaningful modules in several applications [1], [8], [23]–[25], [33], [39]–[43] but our theoretical results transcend this particular module detection method. Any module detection method that results in clusters of highly correlated gene expressions could be used.

In the following, we assume that a module detection method (e.g., a clustering procedure) has found *Q* modules. We denote the adjacency matrix of the genes inside the *q*th module by *A*
^{(q)}. Thus, *A*
^{(q)} represents a subnetwork comprised of the genes in the *q*th module. Analogously, we define *GS*
^{(q)} as the gene significance measure restricted to the module genes. Denote by *n*
^{(q)} the number of genes inside the *q*th module. Throughout the manuscript, we use the superscript (*q*) to denote quantities associated with the *q*th module. But for notational convenience, we sometimes omit (*q*) when the context is clear.

We define an intramodular network concept *NCF*(*A*
^{(q)},*GS*
^{(q)}) by evaluating a network concept function *NCF*(·,·) on the adjacency matrix *A*
^{(q)} and/or a corresponding gene significance measure *GS*
^{(q)}.

For example, the intramodular connectivity is defined by

(17)

where the *j* indexes the genes in the *q*th module. Intramodular connectivity has been found to be an important complementary gene screening variable for finding biologically important genes [24],[25],[39].

We refer to the network significance (Equation 14) of a module network simply as the **module significance measure**, i.e., the module significance is the average gene significance of the module genes:

(18)

The high dimensionality of gene expression data has inspired two broad categories of data reduction techniques. The first category, often used by network theorists, is to reduce the gene coexpression networks into modules. Each module can be represented by a centroid, e.g., an intramodular hub gene. The second category, often used by microarray data analysts, reduces the gene expression data to a small number of components that capture the essential behavior of the expression profiles [27], [44]–[51]. One of our goals is to understand how the two categories of data reduction methods relate to each other. Here we use the singular value decomposition [44],[45],[48] since this will allow us to define a simple measure of factorizability (Equation 24).

For the *q*th module, denote by *X*
^{(q)} the *n*
^{(q)}×*m* matrix of *n*
^{(q)} gene expression profiles across *m* microarrays:

(19)

where *x _{i}* denotes the gene expression vector of the

The singular value decomposition (SVD) of *X*
^{(q)} is given by *X*
^{(q)}=*U*
^{(q)}
*D*
^{(q)}(*V*
^{(q)})* ^{T}*, where

(20)

The singular value decomposition of *X*
^{(q)} is closely related to the principal component analysis of the correlation matrix *COR*=[cor(*x _{i}*

We assume that the singular values |*d _{l}*

(21)

For brevity, we sometimes drop the superscript (*q*) and simply refer to *E* as the eigengene. The module eigengene can be used to summarize and represent the expression profiles of the module genes, see Figure 4B. The proportion of variance explained by the module eigengene *E*
^{(q)} is defined as

(22)

The module eigengenes of different modules can be highly correlated (Figure 4A). Detecting a high correlation between module eigengenes may either be of biological interest (suggesting interactions between pathways) or it may be a methodological artifact (suggesting poorly defined modules that should be merged). The correlations between two eigengenes can be used to define eigengene coexpression networks [52], e.g., a weighted eigengene coexpression network can be defined as follows

(23)

where *E*
^{(q)} and *E*
^{(p)} represent the eigengenes of two distinct modules. Apart from correlating the module eigengenes of different modules to each other, one can relate the module eigengenes to an external microarray sample trait *T* to identify trait related modules. Thus, eigengene network analysis can be viewed as a network reduction scheme that reduces a gene coexpression network involving thousands of genes to an orders of magnitude smaller metanetwork involving module representatives (one eigengene per module).

Unlike traditional microarray data reduction methods that impose orthogonality (e.g., principal component analysis) or independence (e.g., independent component analysis), gene coexpression network analysis can be considered a pathway-based data reduction method that allows dependencies between the modules. When focusing on the use of module eigengenes, network analysis can be considered a variant of oblique factor analysis.

While a high level view of modular gene coexpression networks can be viewed as a data reduction technique, many network analyses focus on the pairwise relationships of relatively few (hundreds) of correlated genes, i.e., genes that form a single module in a larger network. For example, the 498 genes of our motivational example were part of a body weight related module, which was found in a large gene coexpression network based on the female mouse liver samples [23].

The low-level analysis of a single network module may help identify key genes that may be used as therapeutic targets or candidate biomarkers. An important question of low level analysis is to efficiently describe the connection strengths between interacting module genes. We have provided *empirical* evidence that many module adjacency matrices, i.e., networks comprised of genes of a single module, are approximately factorizable [8]. In such networks, the adjacency between module genes *i* and *j* can approximately be factored into gene specific contributions, i.e., *a _{ij}*

An open theoretical research question is to characterize microarray data that lead to factorizable coexpression networks. Here we solve this problem for the case of modules in a gene coexpression network. Toward this end, we propose the following measure of **eigengene factorizability**:

(24)

Note that 0≤*EF*(*X*
^{(q)})≤1 and the close resemblance to the proportion of variance explained by the module eigengene (Equation 22). In the Methods section, we argue that *EF*(*X*
^{(q)})≈1 implies that the correlation matrix factors as follows

Further, we derive the following

*If the eigengene factorizability EF(X ^{(q)}) is close to 1, the adjacencies of the weighted coexpression module network A^{(q)}=|cor(X^{(q)})|^{β} and the trait-based gene significance measure GS_{i}^{(q)}=|cor(x_{i}^{(q)},T)|^{β} can be factored as follows*

(25)

*where*

(26)

*is referred to as the *
*eigengene conformity** of the ith gene, and*

(27)

*is referred to as the qth module *
*eigengene significance** with respect to T, also denoted as EigengeneSignif ^{(q)}.*

As described in Table 1, the eigengene significance and the eigengene conformity are the eigengene-based counterparts of the centroid significance (Equation 15) and centroid conformity (Equation 16), respectively.

The eigengene-based approximations on the right hand side of Equation 25 motivate us to define the eigengene-based adjacency matrix *A _{E}*

(28)

(29)

For our coexpression modules, we find empirically that the eigengene factorizability is close to 1 (see Table 2, Text S1, Text S2, and Text S3).

Abstractly speaking, Observation 1 allows us to characterize coexpression networks for which the adjacency *a _{ij}* can be approximated by a product of the centroid conformities (Equation 16):

In the Methods section, we argue that *EF*(*X*
^{(q)})≈1 if the module gene expressions *x _{i}*

Here we define **eigengene-based network concepts** as a step towards a geometric interpretation of network concepts. Analogous to the case of intramodular network concepts, we define eigengene-based network concepts by evaluating the network concept function *NCF*(*A _{E}*

(30)

where . Under the assumptions of Observation 1, we find that *A*
^{(q)}≈*A _{E}*

*If A ^{(q)}=|cor(X^{(q)})|^{β} and the eigengene factorizability EF(X^{(q)}) is close to 1, the network concepts can be approximated by their eigengene-based analogs.*

This observation is illustrated in Figure 6.

It can be advantageous to replace network concepts by their eigengene-based analogs when studying theoretical properties. To illustrate this point, we briefly describe the effect of soft thresholding *a _{ij}*=

A major theoretical advantage of eigengene-based network concepts is that they reveal simple relationships amongst each other. For example, it is straightforward to derive

(31)

To arrive at particular simple relationships among network concepts, we make use of the following terminology. We denote the **maximum eigengene conformity** as *a _{e}*

(32)

We refer to Equation 32 as the **maximum conformity assumption**. With the results in the Methods section, one can show that the maximum conformity assumption implies the following

*If A ^{(q)}=|cor(X^{(q)})|^{β}, EF(X^{(q)})≈1 and the maximum conformity assumption applies, intramodular network concepts satisfy the following relationships*

(33)

(34)

(35)

(36)

(37)

(38)

(39)

(40)

*where mean(ClusterCoef ^{(q)}) denotes the mean clustering coefficient, ClusterCoef_{max}^{(q)}=max_{j}(ClusterCoef_{j}^{(q)}) and MAR_{max}^{(q)}=max_{j}(MAR_{j}^{(q)}).*

In practice, we find that the maximum conformity assumption holds well for low values of *β*. Below, we study the robustness of our results with respect to higher powers and alternative network construction methods.

Observations 2 and 3 allow us to provide a geometric interpretation of intramodular network concepts.

The relationship between the scaled intramodular connectivity *K _{i}*

We provide two geometric interpretations of the density. The first makes use of the relationship *a _{ij}*

The eigengene-based heterogeneity equals the coefficient of variation of the *a _{E}*

The *i*th gene has high eigengene-based significance *GS _{E,i}*

We provide two geometric interpretations of the module significance (Equation 14). The first interpretation is based on the definition of the module significance as average gene significance; a module has high module significance if on average the angles between the module expression profiles and the sample trait tend to be small. The second interpretation of the module significance is based on Equation 37: a module has high significance if the module density is high and the angle between the module eigengene and the sample trait is small.

Here we illustrate how the geometric interpretation of gene coexpression networks can be used to derive results, which may be interesting to microarray data analysts.

Multiple approaches are conceivable for summarizing the expression profiles of the genes inside a single module. One approach (popular with statisticians) applies a singular value decomposition to the expression data and summarizes the module with the module eigengene. Another approach (popular with network theorists) is to construct a module network and to use the most highly connected hub gene as centroid. Since Equation 33 implies that hub genes are highly correlated with the module eigengene, we find that the two seemingly different approaches will lead to very similar results in practice (Figure 4C).

Since module construction is computationally intensive, one often restricts the module detection analysis to a subset of the original genes on the microarray, e.g., the most varying and/or the most connected genes. To counter this loss of information, generalizing the intramodular connectivity to extramodular genes, i.e. genes outside the module, is an important problem. Our solution is motivated by the relationship between the intramodular connectivity and its eigengene based analog (Equation 33). Specifically, the *q*th module eigengene gives rise to an eigengene-based scaled intramodular connectivity measure

(41)

Under the assumptions of Observation 3, Equation 33 implies that *K _{i}*

Module detection usually involves certain parameter choices. For some genes, it may be difficult to decide whether they belong to a particular module or whether they belong to more than one module. Instead of reporting a binary indicator of module membership, it can be advantageous to report a fuzzy measure of module membership, which takes on values in the unit interval [0,1]. A natural choice for a fuzzy measure of module membership is the eigengene-based scaled intramodular connectivity measure *K*
_{cor,i}
^{(q)} (Equation 41). The fuzzy module membership measures *K*
_{cor,i}
^{(q)} specify how close gene *i* is to modules *q*=1,…,*Q*. It is straightforward to use these measures for finding genes that are close to two modules, i.e., intermediate genes. In Figure 7, we show the pairwise relationships among different *K*
_{cor,i}
^{(q)} measures where the genes are colored by their original module assignment. Note that many of the nonmodule (grey) genes lie intermediate between the proper module genes.

In the following, we provide several examples that illustrate potential uses of the geometric interpretation.

While fundamental network concepts are defined as functions of the network adjacency matrix, their eigengene-based analogs are often simple monotonic functions of correlation coefficients. This insight can be used to attach significance levels (*p*-values) to several eigengene-based network concepts. For example, the eigengene-based hub gene significance is a monotonic function of the correlation between the eigengene and the sample trait (Equation 34). Thus, one can use a correlation test *p*-value [53] or a regression-based *p*-value for assessing the statistical significance between *E*
^{(q)} and the sample trait *T*. Analogously, one can attach a significance level to the fuzzy module membership measures *K*
_{cor,i}
^{(q)} (Equation 41).

Since the gene coexpression network concepts are based on correlations between quantitative variables, one can use permutation test procedures to attach significance levels to network concepts. By randomly permuting the gene expression values of each gene, it is possible to noise up the correlation structure inherent in the original data. We find that the resulting permuted data lead to networks with low density and low mean clustering coefficients (reflecting the lack of large modules).

The relationship between centralization and density (Equation 40) is surprisingly simple for coexpression networks but it does not hold in general networks. For a general network, one can only derive an upper bound for the centralization in terms of the density [35]. As a caveat, we mention that our empirical studies (described below) show that Equation 40 is not very robust with regard to deviations from our theoretical assumptions.

The geometric interpretation of gene coexpression network analysis can be used to argue that a gene that lies “intermediate” between two distinct modules cannot be a highly connected intramodular hub gene in either module (see Figure 5B). More precisely, we refer to gene *i* as hub gene in module 1 if its scaled connectivity *K _{i}*

Equation 33 allows us to translate statements about the scaled intramodular connectivity into statements about the angles between genes and module eigengenes. A gene is an intermediate gene if it has a moderately small angle with both module eigengenes. If the eigengenes are distinct (i.e., the angle between them is large), the intermediate gene cannot have a very small angle with either module eigengene, i.e., it cannot be an intramodular hub gene in either module. A geometric interpretation of this example can be found in Figure 5B.

As an important caveat, we mention that intermediate network genes may well be highly connected “hub” genes if the factorizability property does not hold such as in the entire network comprised of multiple distinct modules.

For a trait-based gene significance measure, the striking relationship between module significance and hub gene significance (Equation 37) suggests a positive relationship between connectivity and gene significance (high hub gene significance) in modules that are enriched with significant genes (high module significance).

Further, Equation 34 shows that the hub gene significance of a module network is determined by the angle between the module eigengene and the sample trait. This allows us to describe situations when a module has high hub gene significance, i.e., when there is a strong positive relationship between gene significance and intramodular connectivity. In the example provided in Figure 5C and 5D, the angle between *E* and *T2* is small which implies that the hub gene significance with regard to *GS2 _{i}*=|cor(

To facilitate the communication between microarray data analysts and network theorists, we provide a short dictionary for translating between microarray data analysis and network theory terminology. More specifically, for a subset (module) of genes that have high expression factorizability, Table 1 describes the correspondence between general network terms and their eigengene-based counterparts. While our theoretical derivations assume a weighted gene coexpression network, our robustness studies show empirically that many of the findings apply to unweighted networks as well. The summary of empirical robustness studies is described below.

In general, eigengene-based concepts are no substitute for network concepts. It is natural to use network concepts when describing the pairwise relationships between genes and to use eigengene-based network concepts when relating the gene expression profiles to a module eigengene. Since eigengene-based network concepts tend to be relatively simple, they often simplify theoretical derivations. Further, many of them allow one to calculate a statistical significance level (*p*-value) using a correlation or regression based test statistic.

To illustrate the theoretical results we report 4 different microarray data applications. The underlying data sets and R software code can be found on our webpage http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/GeometricInterpretation/.

Here we describe a weighted gene coexpression network that was constructed on the basis of 55 microarray samples of glioblastoma (brain cancer) patients. A detailed description of the data, modules, and biological implications can be found in [24]. We defined 6 modules as branches of an average linkage hierarchical cluster tree (Figure 3B). Module membership in the 6 “proper” modules is color-coded by turquoise, blue, brown, yellow, green and red. Grey denotes the color of genes that were not grouped into any of the 6 proper modules. To allow for a comparison, we also report results for the “improper” module comprised of grey genes.

We used the patient survival time as microarray sample trait *T*. We defined a gene significance measure as the absolute value of the correlation between *T* and the gene expression profiles (Equation 4). The module significance was defined as average gene significance (Equation 14). Figure 3C shows that the brown module had the highest module significance. This module was previously found to be enriched with genes that are prognostic of patient survival [24].

By relating the gene significance measure *GS _{i}* to the scaled connectivity

We defined the module eigengene significance (Equation 27) as the absolute value of the correlation between the module eigengene and patient survival time. The brown module eigengene also had the highest eigengene significance: *a _{e,t}*

We visualize the gene expression profiles of module genes with a heat map plot (Figure 4B) where rows correspond to the genes, the columns to the samples, and the gene expression profiles have been standardized to a mean of 0 and a variance of 1. The heat map colors high and low expression values by red and green, respectively. For a given module, the heat map exhibits characteristic vertical bands that reflect the high correlation among module gene expression profiles. For the 6 proper modules of our brain cancer application, the proportion of variance explained by the first eigengene ranges from 0.59 to 0.71 (Table 2). For the improper grey module genes (defined as genes outside of all proper modules) the proportion of variance explained by the first eigengene is only 0.28. Similarly, when all network genes are used to define an improper module, the proportion of variance explained by the first eigengene is only 0.32. As expected by module construction, we find that the gene expression data of proper modules have high eigengene factorizabilities *EF*(*X*)≥0.97 (Table 2). By contrast, the factorizability of the grey genes (i.e., the genes outside of proper modules) is relatively low (*EF*(*X*)=0.66).

For each module, Table 2 reports network properties including network size, density, centralization, heterogeneity, mean clustering coefficient, module significance, hub gene significance, and eigengene significance. For the proper (nongrey) modules, we find that the numerical values of the intramodular network concepts and their eigengene-based analogs support our theoretical derivations.

Our empirical results illustrate Observation 2 regarding the relationship between intramodular network concepts and their eigengene-based analogs. Figure 6A–E depict the relationships among centralization, heterogeneity, clustering coefficient, module significance, hub gene significance and their respective eigengene-based analogs when a soft threshold of *β*=1 is used for the weighted network construction (Equation 2). The analogous results for *β*=6 are depicted in Figure 6G–K. Figure 6F and 6L depicts the relationship between hub gene significance (Equation 13) and module eigengene significance (Equation 27) for *β*=1 and *β*=6, respectively. For completeness, we also report the results for the grey, nonmodule genes in the figures. But since our theoretical results assume proper modules, we exclude the grey genes from the calculation of the squared correlation coefficient *R*
^{2}. The summary of a robustness analysis with regard to different soft thresholds *β* and hard thresholds *τ* is reported in Table 3 and Text S1. Overall, we find very high squared correlations (*R*
^{2}>0.85), which confirm our theoretical results. Only the *R*
^{2} values for the relationship between clustering coefficient and its eigengene-based analog is decreased if *β*>3.

Figure 8 illustrate the implications of Observation 3 regarding the relationships among network concepts in the cancer coexpression module networks. Figure 8A shows that the scaled connectivity *K _{i}*

Figure 8B illustrates the relationship between the clustering coefficient (the mean corresponds to the short horizontal line) and (1+*Heterogeneity*
^{2})^{2}×*Density* (Equation 31). This relationship is diminished for soft thresholds *β*>3 as can be seen from Table 3.

Figure 8C illustrates the relation (Equation 37), which is highly robust with regard to different choices of *β* (Table 3). Figure 8D illustrates (Equation 40). This relationship is *not* robust with regard to *β*: the *R ^{2}* value is only 0.058 for

Although our theoretical results were derived using relatively restrictive assumptions, we find that most results are robust in the weighted networks, see Figure 9, Table 3, and Text S1. However, in unweighted networks, several relationships have lower *R*
^{2} values and show a strong dependence on the hard threshold *τ* (Table 3).

The mouse tissues came from an F2 intercross between two mouse strains C3H/HeJ and C57BL/6J. The data were already described above and in Figure 1. The 498 genes were part of a body weight related module in liver tissue (the Blue module described in reference [23]). Table 4 presents network concepts and their eigengene-based analogs in the different tissue networks. As predicted by Observation 2, we find a close relationship between the two types of network concepts if the eigengene factorizability of the corresponding network is close to 1. This example also illustrates that our results apply to coexpression networks comprised of relatively few genes (here 498 genes).

Here we focus on the female mouse liver tissues of the above-mentioned F2 mouse cross. Specifically, 135 female mice were used to construct a weighted network comprised of 3,400 highly connected genes. The biological significance and gene ontology enrichment analysis of the 12 modules in this large network is described in [23]. In Text S2, Table 5, and Figure 9, we focus on the relationships among the network concepts. We find that many of our theoretical results hold approximately even if the expression factorizability is low. Table 5 shows how the relationship (*R*
^{2} values) between network concepts and their eigengene-based analogs depend on the soft threshold *β*. Overall, we find that our theoretical results are highly robust in weighted networks. The relationship between the clustering coefficient and its eigengene-based analog is diminished (down to 0.44) for *β*>3. The relationship between heterogeneity and its eigengene-based analog is diminished (down to 0.54 when *β*<3).

The relation (Equation 40) has a relatively low *R*
^{2} value (down to 0.21) for low values of *β*≤3 but the other relationships among network concepts are highly robust with respect to *β*. For unweighted networks, the *R*
^{2} values tend to be lower and several relationships show a marked dependency on the hard threshold *τ* (Table 5).

In Text S3, we illustrate our theoretical derivations using a yeast gene coexpression network. The yeast microarray data were derived from experiments designed to study the cell cycle [54]. A detailed biological description of the modules and the importance of intramodular connectivity can be found in previous work [33]. In Text S3 and in Figure 9, we use a gene significance measure that encodes knock-out essentiality, i.e., *GS _{i}*=1 if the

Table 6 shows how the relationship (squared correlation *R*
^{2}) between network concepts and their eigengene-based analogs depend on the soft threshold *β*. Overall, we find that our theoretical results are highly robust in weighted networks. The relation (Equation 40) breaks down for *β*=3 or 4 but the other relationships among network concepts are highly robust with respect to *β*. For unweighted networks, the *R*
^{2} values tend to be lower and several relationships show a marked dependency on the hard threshold *τ* (Table 6).

Network theoretic methods and concepts are increasingly used for the systems biologic analysis of microarray data. We illustrate how network concepts can be used for describing large correlation matrices and for arriving at biologically plausible data reduction techniques. Many alternative approaches for defining gene coexpression networks are possible, e.g., [13], [55]–[61]. Here we define the network adjacency and the gene significance measure in terms of correlations since this allows us to interpret pairwise relations in terms of angles between scaled versions of the variables. For example, the sample trait based gene significance measure of the *i*th gene is determined by the angle between the *i*th gene expression profile and the sample trait *T* (Equation 4); the scaled intramodular connectivity of the *i*th gene (Equation 33) is determined by the angle between the *i*th gene expression profile and the module eigengene; the hub gene significance (Equation 34) is determined by the angle between module eigengene and the sample trait.

The geometric interpretation of gene coexpression network analysis reveals a deep connection to other statistical methods. Since it projects the gene expressions profiles onto the hypersphere in an *m*-dimensional Euclidean space, network analysis can be considered a special case of directional statistics. When focusing on the use of module eigengenes, network analysis can be considered a variant of oblique factor analysis.

A high level view of modules and their centroids (eigengenes) can be used to define eigengene networks [52]. High correlations (small angles) between module eigengenes may suggest close relationships between the corresponding pathways. A low level view of a single module allows us to provide a geometric interpretation of intramodular network concepts. We use the singular value decomposition of module expression data to characterize approximately factorizable gene coexpression networks, i.e., adjacency matrices that satisfy *a _{ij}*

The derivation of Observation 1 in the Methods section highlights a theoretical advantage of the soft-thresholding approach (Equation 2); the resulting weighted network maintains the approximate factorizability of the underlying correlation matrix: *a _{ij}*

Using multiple different gene coexpression networks from mouse tissues, brain cancer, and yeast, we provide empirical evidence that coexpression modules tend to have high eigengene factorizability and that the maximum conformity assumption (Equation 32) is satisfied for low powers of *β*.

We propose eigengene-based analogs of network concepts (Equation 30). While network concepts are functions of the adjacency matrix, eigengene-based network concepts are analogous functions of the eigengene conformities |cor(*x _{i}*

We use the correspondence between intramodular network concepts and their eigengene-based analogs to provide a geometric interpretation of network concepts. Observation 2 states that network concepts in weighted gene coexpression module networks are approximately equal to their eigengene-based analogs. A major theoretical advantage of eigengene-based network concepts is that they reveal simple relationships. To arrive at particularly simple relationships, we make the maximum conformity assumption (Equation 32) for the results presented in the main text. Table 1 provides a rough dictionary for translating between gene coexpression network analysis and the singular value decomposition if the underlying expression data have high eigengene factorizability (say *EF*(*X*
^{(q)})>0.95) and if the maximum conformity assumption (Equation 32) is satisfied. However, even if the maximum conformity assumption does not hold, one can still find simple relationships among the network concepts (Equation 49).

The geometric interpretation of gene coexpression networks facilitates the derivation of several results that should be interesting to network theorists. For example, we argue that highly connected intramodular hub genes cannot be intermediate between two distinct coexpression modules (Figure 5B). The geometric interpretation is particularly useful when studying gene significance and module significance measures that are based on a microarray sample trait (Equation 4). To study the relationship between connectivity and gene significance, we propose a novel measure of hub gene significance (Equation 13). We find that the hub gene significance of a module network is determined by the angle between the module eigengene and the microarray sample trait (Equation 34). Our geometric interpretation of coexpression networks allows us to describe situations when a module has low hub gene significance (Figure 5C and 5D). Our theoretical derivations for relating module significance to hub gene significance (Equation 37) assumes a gene significance measure based on a sample trait. Although this important assumption is violated for the gene significance measure (knock-out essentiality) in the yeast network, it is striking that the relationship between hub gene significance and module significance can still be observed in this application (Figure 9).

We provide a robustness analysis that shows that many of our theoretical results apply even if our underlying assumptions are not satisfied (Figures 6 and and9,9, Tables 3, ,5,5, and and6,6, Text S1, Text S2, and Text S3). We find that the correspondence between network concepts and their eigengene-based analogs is often better in weighted networks than in unweighted networks. Further, we find that the results in weighted networks tend to be more robust than those in unweighted networks with regard to changing the network construction thresholds *β* and *τ*, respectively. Thus, weighted coexpression networks are preferable over unweighted networks when a geometric interpretation of network concepts is desirable.

The correspondence between coexpression module networks and the singular value decomposition (Table 1) can break down when a high soft threshold is used for constructing a weighted network or when dealing with an unweighted network. Thus, eigengene-based concepts do not replace network concepts when describing interaction patterns among genes.

While this article has a theoretical bent, we illustrate the results on three different microarray data sets (human, mouse, and yeast) that are described in our online R software tutorials, in Text S1, Text S2, and Text S3. Our theoretical results also apply to networks comprised of genes that are highly correlated with a sample trait. The key assumption underlying our results is high eigengene factorizability *EF*(*X*
^{(q)}). To illustrate this point, Text S4 describes a brain cancer network comprised of the 500 genes with highest absolute correlation with brain cancer survival time. Our results illustrate that the geometric interpretation of gene coexpression networks has important theoretical and practical implications that may guide the development and application of network methods.

Analogous to [8], we define a *network concept function* to be function of a square matrix *M*=[*M _{ij}*] (1≤

We make use of the following network concept functions:

(42)

where the components of matrix *B _{M}* in the denominator of the clustering coefficient function are given by

According to our convention, the diagonal elements of the adjacency matrix are set to 1. Therefore, the diagonal elements of *A–I* (where *I* denotes the identity matrix) equal 0. Now we are ready to define the (fundamental) network concepts that are studied in this article.

**Definition of Fundamental Network Concepts:**
*The fundamental network concepts of a network A are defined by evaluating the network functions (*
*Equation 42*
*) on A–I and the gene significance measure GS, i.e.,*

For example, the connectivity is given by

(43)

We define an **intramodular network concept**
*NCF*(*A*
^{(q)}−*I*,*GS*
^{(q)}) by evaluating the network concept function on the restricted adjacency matrix *A*
^{(q)} and the restricted gene significance measure *GS*
^{(q)}.

We will now define eigengene-based network concepts. Using the eigengene-based adjacency matrix *A _{E}*

As example, consider the eigengene-based connectivity given by

(44)

Here we derive Observation 1, which characterizes approximately factorizable gene coexpression module networks. To simplify the presentation, we omit the superscripts (*q*) in the following, e.g., we will write *EF*(*X*) instead of *EF*(*X*
^{(q)}). We will argue that if the eigengene factorizability *EF*(*X*) is close to 1, the adjacencies of the weighted coexpression module network *A*=|cor(*X*)|* ^{β}* and the trait-based gene significance measure

(45)

where

(46)

(47)

Since our gene coexpression networks are defined with respect to the correlation matrix [cor(*x _{i}*,

Note that *u*
_{1,i}|*d*
_{1}|^{2}
*u*
_{1,j}/*m*=cor(*x _{i}*,

This equation motivates us to propose the following measure of **eigengene factorizability**:

(48)

Note that 0≤*EF*(*E*)≤1. By definition *EF*(*E*)≈1 implies that

By raising both sides of this equation to a power *β*, we find

The last step highlights an important theoretical advantage of the soft thresholding method: it preserves the approximate factorizability of the underlying correlation matrix.

An alternative, possibly more direct way of motivating the observation is based on the insight that the squared singular values *|d _{l}|^{2}* correspond to the eigenvalues of the correlation matrix

where u
_{1} denotes an eigenvector of length 1.

Here we describe relationships among eigengene-based network concepts if the maximum conformity assumption does not hold (i.e., *a _{e}*

(49)

Observation 2 can be used to derive the following

*If A ^{(q)}=|cor(X^{(q)})|^{β} and the eigengene factorizability is close to 1 (EF(X^{(q)})≈1), the relationships among eigengene-based concepts approximately apply to their network analogs as well.*

For example, we find that

In the following we provide details on our geometric interpretation of the factorizability. To simplify the notation, we sometimes drop the superscript (*q*) in the following expressions. We denote by *θ _{l,i}* the angle between the right singular vector

(50)

Thus, *EF*(*X*
^{(q)})≈1 if the module gene expressions *x _{i}* are approximately orthogonal (cos(

Under this assumption, we provide a rough geometric intuition of *a _{ij}*≈

(51)

i.e., the correlation matrix is approximately factorizable.

Here we prove that the eigengene-based heterogeneity increases with the soft threshold *β* (Equation 2). Recall that (Equation 30) which implies that it is a decreasing function of

(52)

Note that *a _{i}*=|cor(

To prove that the heterogeneity increases with *β*, it suffices to prove the following

**Proposition:**
* Let {a _{i}, i=1,…,n} be a group of nonnegative number and β>1 then the following inequality holds*:

(53)

To prove the Proposition, we will make use of the following

**Lemma:**
*Let {u _{i}, i=1,…,n} and {v_{i}, i=1,…,n} be groups of nonnegative numbers, and θ be a number 0≤θ<*1.

(54)

The Lemma can be proved with Hölder's inequality, which is given by

(55)

We use the Lemma with *θ*
_{1}=*β*/(*2β*−1), *u _{i}*=

Further, we use the Lemma with *θ*
_{2}=(*2β*−2)/(2*β*−1), *u _{i}*=

By squaring the first inequality and multiplying it with the second inequality, we arrive at

since 2*θ*
_{1}+*θ*
_{2}=2 and 3−(2*θ*
_{1}+*θ*
_{2})=1. The last inequality completes the proof since it is equivalent to the inequality in Equation 53.

Robustness Analysis of the Brain Cancer Gene Coexpression Network. This supporting text provides a detailed analysis of the brain cancer gene coexpression network. The robustness analysis illustrates how the results change with regard to different network construction methods.

(3.83 MB PDF)

Click here for additional data file.^{(3.6M, pdf)}

Robustness Analysis of the Mouse Gene Coexpression Network. This supporting text provides a detailed analysis of the mouse tissue gene coexpression network. The robustness analysis illustrates how the results change with regard to different network construction methods.

(3.76 MB PDF)

Click here for additional data file.^{(3.5M, pdf)}

Robustness Analysis of the Yeast Gene Coexpression Network. This supporting text provides a detailed analysis of the yeast cell cycle gene coexpression network. The robustness analysis illustrates how the results change with regard to different network construction methods.

(2.62 MB PDF)

Click here for additional data file.^{(2.4M, pdf)}

Brain Cancer Network Comprised of 500 Prognostic Genes. Here we analyze a brain cancer network comprised of the 500 genes with highest absolute correlation with brain cancer survival time. The results illustrate that our theoretical results also apply to small networks comprised of sample trait related genes. The robustness analysis illustrates how the results change with regard to different network construction methods.

(0.38 MB PDF)

Click here for additional data file.^{(370K, pdf)}

We are grateful for discussions with Andy Yip, Lora Bagryanova, Dan Geschwind, Peter Langfelder, Tova Fuller, Jake Lusis, Tom Drake, Paul Mischel, Stan Nelson, Mike Oldham, Anja Presson, Lin Wang, and Nan Zhang.

The authors have declared that no competing interests exist.

We acknowledge grant support from 1U19AI063603-01, P50CA092131, 1U24NS043562-01, 5P30CA016042-28, and HL28481.

1. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL. Hierarchical organization of modularity in metabolic networks. Science. 2002;297:1551–1555. [PubMed]

2. Ihmels J, Bergmann S, Barkai N. Defining transcription modules using large-scale gene expression data. Bioinformatics. 2004;20:1993–2003. [PubMed]

3. Barabasi AL, Oltvai ZN. Network biology: understanding the cell's functional organization. Nat Rev Genet. 2004;5:101–113. [PubMed]

4. Albert R. Scale-free networks in cell biology. J Cell Sci. 2005;118:4947–4957. [PubMed]

5. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, et al. Network motifs: simple building blocks of complex networks. Science. 2002;298:824–827. [PubMed]

6. Resendis-Antonio O, Freyre–Gonzalez JA, Menchaca-Mendez R, Gutierrez-Rios RM, Martinez-Antonio A, et al. Modular analysis of the transcriptional regulatory network of E. coli. Trends Genet. 2005;21:16–20. [PubMed]

7. Balazsi G, Barabasi AL, Oltvai ZN. Topological units of environmental signal processing in the transcriptional regulatory network of Escherichia coli. Proc Natl Acad Sci U S A. 2005;102:7841–7846. [PubMed]

8. Dong J, Horvath S. Understanding network concepts in modules. BMC Syst Biol. 2007;1:24. [PMC free article] [PubMed]

9. Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. [PubMed]

10. Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998;95:14863–14868. [PubMed]

11. Ideker T, Ozier O, Schwikowski B, Siegel A. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics. 2002;18:S233–S240. [PubMed]

12. Huang Y, Li H, Hu H, Yan X, Waterman M, et al. Systematic discovery of functional modules and context-specific functional annotation of human genome. Bioinformatics. 2007;23:i222–i229. [PubMed]

13. Butte A, Tamayo P, Slonim D, Golub T, Kohane I. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci U S A. 2000;97:12182–12186. [PubMed]

14. Zhou X, Kao M, Wong W. Transitive functional annotation by shortest path analysis of gene expression data. Proc Natl Acad Sci U S A. 2002;99:12783–12788. [PubMed]

15. Steffen M, Petti A, Aach J, D'haeseleer P, Church G. Automated modelling of signal transduction networks. BMC Bioinformatics. 2002;3:34. [PMC free article] [PubMed]

16. Stuart JM, Segal E, Koller D, Kim SK. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302:249–255. [PubMed]

17. Carter S, Brechbuler C, Griffin M, Bond A. Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics. 2004;20:2242–2250. [PubMed]

18. Bergmann S, Ihmels J, Barkai N. Similarities and differences in genome-wide expression data of six organisms. PLoS Biol. 2004;2:e9. doi:10.1371/journal.pbio.0020009. [PMC free article] [PubMed]

19. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4:17. [PubMed]

20. Cabusora L, Sutton E, Fulmer A, Forst C. Differential network expression during drug and stress response. Bioinformatics. 2005;21:2898–2905. [PubMed]

21. Wei H, Persson S, Mehta T, Srinivasasainagendra V, Chen L, et al. Transcriptional coordination of the metabolic network in arabidopsis. Plant Physiol. 2006;142:762–774. [PubMed]

22. Voy BH, Scharff JA, Perkins AD, Saxton AM, Borate B, et al. Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Comput Biol. 2006;2:e89. doi:10.1371/journal.pcbi.0020089. [PubMed]

23. Ghazalpour A, Doss S, Zhang B, Plaisier C, Wang S, et al. Integrating genetics and network analysis to characterize genes related to mouse weight. PloS Genet. 2006;2:8. doi:10.1371/journal.pgen.0020130. [PMC free article] [PubMed]

24. Horvath S, Zhang B, Carlson M, Lu K, Zhu S, et al. Analysis of oncogenic signaling networks in glioblastoma identifies aspm as a novel molecular target. Proc Natl Acad Sci U S A. 2006;103:17402–17407. [PubMed]

25. Oldham M, Horvath S, Geschwind D. Conservation and evolution of gene coexpression networks in human and chimpanzee brains. Proc Natl Acad Sci U S A. 2006;103:17973–17978. [PubMed]

26. Fuller T, Ghazalpour A, Aten J, Drake T, Lusis A, et al. Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome. 2007;18:463–472. [PMC free article] [PubMed]

27. Shen R, Ghosh D, Chinnaiyan A, Meng Z. Eigengene-based linear discriminant model for tumor classification using gene expression microarray data. Bioinformatics. 2006;22:2635–2642. [PubMed]

28. Chuang H, Lee E, Liu Y, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007;3:140. [PMC free article] [PubMed]

29. Albert R, Jeong H, Barabasi AL. Error and attack tolerance of complex networks. Nature. 2000;406:378–382. [PubMed]

30. Jeong H, Mason S, Barabasi A, Oltvai Z. Lethality and centrality in protein networks. Nature. 2001;411:41–42. [PubMed]

31. Albert R, Barabasi A. Statistical mechanics of complex networks. Rev Mod Phys. 2002;74:47–97.

32. Han J, Bertin N, Hao T, Goldberg D, Berriz G, et al. Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature. 2004;430:88–93. [PubMed]

33. Carlson M, Zhang B, Fang Z, Mischel P, Horvath S, et al. Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks. BMC Genomics. 2006;7:40. [PMC free article] [PubMed]

34. Almaas E, Kovacs B, Vicsek T, Oltvai ZN, Barabasi AL. Global organization of metabolic fluxes in the bacterium Escherichia coli. Nature. 2004;427:839–843. [PubMed]

35. Snijders T. The degree variance: an index of graph heterogeneity. Soc Networks. 1981;3:163–174.

36. Freeman L. Centrality in social networks: conceptual clarification. Soc Networks. 1978;1:215–239.

37. Ma H, Buer J, Zeng A. Hierarchical structure and modules in the Escherichia coli transcriptional regulatory network revealed by a new top-down approach. BMC Bioinformatics. 2004;5:199. [PMC free article] [PubMed]

38. Watts DJ. A simple model of global cascades on random networks. Proc Natl Acad Sci U S A. 2002;99:5766–5771. [PubMed]

39. Gargalovic P, Imura M, Zhang B, Gharavi N, Clark M, et al. Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc Natl Acad Sci U S A. 2006;103:12741–12746. [PubMed]

40. Langfelder P, Zhang B, Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut library for R. Bioinformatics. 2007;24:719–720. [PubMed]

41. Ye Y, Godzik A. Comparative analysis of protein domain organization. Genome Biol. 2004;14:343–353. [PubMed]

42. Yip A, Horvath S. Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics. 2007;8:22. [PMC free article] [PubMed]

43. Li A, Horvath S. Network neighborhood analysis with the multi-node topological overlap measure. Bioinformatics. 2007;23:222–231. [PubMed]

44. Alter O, Brown P, Botstein D. Singular value decomposition for genome-wide expression data processing and modelling. Proc Natl Acad Sci U S A. 2000;97:10101–10106. [PubMed]

45. Holter N, Mitra M, Maritan A, Cieplak M, Banavar J, et al. Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc Natl Acad Sci U S A. 2000;97:8409–8414. [PubMed]

46. West M, Blanchette C, Dressman H, Huang E, Ishida S, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci U S A. 2001;98:11462–11467. [PubMed]

47. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002;99:6567–6572. [PubMed]

48. Yeung MS, Tegner J, Collins J. Reverse engineering gene networks using singular value decomposition and robust regression. Proc Natl Acad Sci U S A. 2002;99:6163–6168. [PubMed]

49. Liao J, Boscolo R, Yang Y, Tran L, Sabatti C, et al. Network component analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad Sci U S A. 2003;100:15522–15527. [PubMed]

50. Adrian D, Chris H, Beatrix J, Joseph R, Guang Y, et al. Sparse graphical models for exploring gene expression data. J Multivar Anal. 2004;90:196–212.

51. Tamayo P, Scanfeld D, Ebert B, Gillette M, Roberts C, et al. Metagene projection for cross-platform, cross-species characterization of global transcriptional states. Proc Natl Acad Sci U S A. 2007;104:5959–5964. [PubMed]

52. Langfelder P, Horvath S. Eigengene networks for studying the relationships between co-expression modules. BMC Syst Biol. 2007;1:54. [PMC free article] [PubMed]

53. Fisher RA. On the ‘probable error’ of a coefficient of correlation deduced from a small sample. Metron. 1915;1:1–32.

54. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998;9:3273–3297. [PMC free article] [PubMed]

55. D'haeseleer P, Liang S, Somogyi R. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000;16:707–726. [PubMed]

56. Perkins TJ, Jaeger J, Reinitz J, Glass L. Reverse engineering the gap gene network of Drosophila melanogaster. PLoS Comput Biol. 2005;2:e51. doi:10.1371/journal.pcbi.0020051. [PMC free article] [PubMed]

57. Barrett CL, Palsson BO. Iterative reconstruction of transcriptional regulatory networks: an algorithmic approach. PLoS Comput Biol. 2006;2:e52. doi:10.1371/journal.pcbi.0020052. [PubMed]

58. Smith VA, Yu J, Smulders TV, Hartemink AJ, Jarvis ED. Computational inference of neural information flow networks. PLoS Comput Biol. 2006;2:e161. doi:10.1371/journal.pcbi.0020161. [PMC free article] [PubMed]

59. Thakar J, Pilione M, Kirimanjeswara G, Harvill E, Albert R. Modeling systems-level regulation of host immune responses. PLoS Comput Biol. 2007;3:e109. doi:10.1371/journal.pcbi.0030109. [PubMed]

60. Price MN, Dehal PS, Arkin AP. Orthologous transcription factors in bacteria have different functions and regulate different genes. PLoS Comput Biol. 2007;3:e175. doi:10.1371/journal.pcbi.0030175. [PMC free article] [PubMed]

61. Needham C, Bradford J, Bulpitt A, Westhead D. A primer on learning in Bayesian networks for computational biology. PLoS Comput Biol. 2007;3:e129. doi:10.1371/journal.pcbi.0030129. [PMC free article] [PubMed]

Articles from PLoS Computational Biology are provided here courtesy of **Public Library of Science**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |