Correlation networks are increasingly being used in biology to analyze large, high-dimensional data sets. Correlation networks are constructed on the basis of correlations between quantitative measurements that can be described by an n × m matrix X = [xil] where the row indices correspond to network nodes (i = 1, . . ., n) and the column indices (l = 1, . . ., m) correspond to sample measurements:
We refer to the i-th row xi as the i-th node profile across m sample measurements.
Sometimes a quantitative measure (referred to as sample trait) is provided for the columns of X. For example, T = (T1, . . ., Tm) could measure survival time or it could be a binary indicator variable (disease status). Abstractly speaking, we define a sample trait T as a vector with m components that correspond to the columns of the data matrix X. A sample trait can be used to define a node significance measure. For example, a trait-based node significance measure can be defined as the absolute value of the correlation between the i-th node profile xi and the sample trait T:
Alternatively, a correlation test p-value [1
] or a regression-based p-value for assessing the statistical significance between xi
and the sample trait T
can be used to define a p-value based node significance measure, for example by defining
The rationale behind correlation network methodology is to use network language to describe the pairwise relationships (correlations) between the rows of X (Equation 1). Although other statistical techniques exist for analyzing correlation matrices, network language is particularly intuitive to biologists and allows for simple social network analogies. Correlation networks can be used to address many analysis goals including the following. First, correlation networks can be used to find clusters (modules) of interconnected nodes. Thus, a network module is a set of rows of X (Equation 1) which are closely connected according to a suitably defined measure of interconnectedness.
A second analysis goal is to summarize the node profiles of a given module by a representative, e.g. a highly connected hub node, which is centrally located in the module. Focusing the analysis on module or their representatives amounts to a network-based data reduction method. Relating modules instead of nodes to a sample trait can alleviate the multiple testing problem.
A third analysis goal is to identify 'significant' modules. Toward this end, a node significance measure can be used to identify modules with high average node significance (referred to as module significance).
A fourth analysis goal is to annotate all network nodes with respect to how close they are to the identified modules. This can be accomplished by defining a fuzzy measure of module memberships that generalizes the binary module membership indicator to a quantitative measure. Fuzzy measures of module membership can be used to identify nodes that lie intermediate between and close to two or more modules.
A fifth analysis goal is to define the network neighborhood of a given seed set of nodes. Intuitively speaking, a neighborhood is composed of nodes that are highly connected to a given set of nodes. Thus, neighborhood analysis facilitates a guilt-by-association screening strategy for finding nodes that interact with a given set of interesting nodes.
A sixth analysis goal is to screen for nodes based on node screening criteria which can be based on a node significance measure, on module membership information, on network topological properties (e.g. high connectivity), etc.
A seventh analysis goal is to contrast one network with another network. This differential network analysis can be used to identify changes in connectivity patterns or module structure between different conditions. An eighth analysis goal is to find shared modules between two or more networks (consensus module analysis). Since by definition consensus modules are building blocks in multiple networks, they may represent fundamental structural properties of the network.
The above incomplete enumeration of analysis goals shows that correlation networks can be used as a data exploratory technique (similar to cluster analysis, factor analysis, or other dimensional reduction techniques) and as a screening method. For example, correlation networks can be used to screen for modules and intramodular hubs that relate to a sample trait. Correlation networks allow one to generate testable hypotheses that should be validated in independent data or in designed validation experiments.
Gene Co-Expression Networks
In the following, we focus on gene co-expression networks which represent a major application of correlation network methodology. Co-expression networks have been found useful for describing the pairwise relationships among gene transcripts [2
]. In co-expression networks, we refer to nodes as 'genes', to the node profile xi
as the gene expression profile, and to the node significance measure GSi
as the gene significance measure. A glossary of important network-related terms can be found in Table . Here we introduce an R software package that summarizes and extends our earlier work on weighted gene co-expression network analysis (WGCNA) [5
]. WGCNA has been used to analyze gene expression data from brain cancer [10
], yeast cell cycle [13
], mouse genetics [14
], primate brain tissue [18
], diabetes [21
], chronic fatigue patients [22
] and plants [23
]. While these publications have made R software code available in various forms, there is a need for a comprehensive R package that summarizes and standardizes methods and functions. To address this need, we introduce the WGCNA R package which also includes enhanced and novel functions for co-expression network analysis.
Glossary of WGCNA Terminology.