Robust methods for summarizing large quantities of genomics data can be critical to advancing biological understanding. To address this issue we created collapseRows, which takes a numeric matrix of data as input, in which rows correspond to the variables, as well as a grouping variable that assigns each row to a group. This function implements standard collapsing methods (such as forming the average), as well as alternative network based methods (such as connectivity based collapsing) as possible strategies for data aggregation, then outputs one representative or composite value per group of rows. Thus, the resulting matrix has the same number of columns as the input matrix but typically has far fewer rows, potentially alleviating computational bottlenecks and often increasing between-study comparability. For example, in Figure we present a hypothetical pipeline for using microarrays from multiple data sets to predict clinical outcome, in which collapseRows is used twice: first to collapse probes to genes, and then to collapse genes to modules.
Figure 1 Example pipeline for using collapseRows functions. The collapseRows function could be used in two steps of a pipeline for finding predictors of a clinical outcome. First, probes could be collapsed into genes by taking the probe with the highest expression (more ...)
In the following, we describe the R implementation of collapseRows and present examples from several empirical studies of high dimensional genomics data. In particular, for each of our examples, we run a subset of the following six collapsing strategies: maximum mean (1.max), maximum variance (2.var), maximum mean + maximum connectivity (3.kMax), maximum variance + maximum connectivity (4.kVar), module eigengene (5.ME), and the average (6.Avg; see Methods for details). Table provides an overview of our empirical studies and the types of employed collapsing strategies.
Summary of data sets and corresponding collapsing strategies
R implementation of the function collapseRows
This collapseRows function implements several network-based and biostatistical methods for finding a single value to represent each group specified in the function call. collapseRows is available in the WGCNA package for R [5
], and is presented as Additional file 1
: the collapseRows function, while code for performing all of the above analyses is presented in Additional file 2
: R code to perform analysis. Accompanying data files can be downloaded from the collapseRows website [6
]. Our code was run using version 2.12.2 of R, although any recent version should work. A few steps of the analysis (i.e., building the networks for example 2) require a computer with a large amount of RAM (a computer with 16 GB of RAM was used for these steps); however, most analysis steps can be performed on laptops (e.g. we used a MacBook Pro with 4 GB RAM in most analyses steps), and the entire analysis can be reproduced in a few hours. The R function is called as follows:
collapseRows(datET, rowGroup, rowID, method="MaxMean",
In this call, datET represents a matrix or data frame containing numeric values where rows correspond to variables (e.g. microarray probes) and columns correspond to observations (e.g. microarrays). Each row must have a unique row identifier (specified in the vector rowID). The group label of each row is encoded in the vector rowGroup (e.g. gene symbols corresponding to the probes in rowID). The argument method is a character string for determining which method is used to collapse rows into groups. The implemented options, described in more detail in the Methods, are "MaxMean," "MinMean," "maxRowVariance," "absMaxMean," "absMinMean," "ME", "average," and "function." If method = "function", the method used is set by the argument methodFunction, which must be a function that takes a matrix of numbers as input and produces a vector the length of the number of columns as output (e.g., colMeans). The logical argument connectivityBasedCollapsing controls whether groups with 3 or more corresponding rows should be collapsed by choosing the row with the highest connectivity according to a signed weighted correlation network adjacency matrix (Methods), where the power is determined by the argument connectivityPower. All collapsing strategies presented in the empirical studies were formed by setting a combination of the method and connectivityBasedCollapsing parameters. Finally, the logical argument selectFewestMissing (with default value TRUE), controls whether datET should be trimmed such that only the rows with the fewest number of missing values per group are retained. Detailed information regarding the function is available within R using the help(collapseRows) command.
Example 1: collapsing microarray probes to genes
Gene expression microarrays often have several probes (or probe sets) per gene. Collapsing methods can be used to arrive at a single measurement per gene, which facilitates merging of data across different platforms. For example, in a previous study of mouse and human brain, a preliminary version of collapseRows was used to convert gene expression matrices from 18 human and 20 mouse data sets run on five separate microarray platforms into matrices that could be compared [3
] (see Methods). Here we re-analyze these same data sets, along with several additional data sets from studies of human blood [7
We evaluate the performance of the collapsing methods in multiple ways. To begin with, we determine how reproducible the result of the collapseRows function is across different data sets. For example, assume two different gene expression data sets in brain are available. After collapsing the probes by gene one can calculate the mean expression value of each gene in the two data sets. Next one can correlate the mean expression levels of the first data set with those of the second data set. A high correlation indicates that the output of the collapsing method allows one to robustly define the mean expression level of a gene. Since we are mainly interested in ranked expression levels, we determine the correlation of ranked mean expressions (which amounts to using the Spearman correlation coefficient). Apart from studying the reproducibility of the mean expressions, we also study the reproducibility of the network connectivity since this informs co-expression network applications. The connectivity (or degree) is the most widely used network statistic for describing the topological properties of a network node (gene) [13
]. It is of great practical interest to determine to which extent the connectivity (and hub status) is preserved between 2 data sets. For example, it has often been used to compare networks from different species [3
]. In this case, we define a signed weighted correlation network (Methods) after collapsing the probes into genes. Next we calculate the whole network connectivity measure ki
, which measures how correlated the i-th gene is with other genes in the network (Methods). A significant positive correlation of ranked connectivity suggests that a co-expression network between the genes (collapsed probes) is robustly defined.
To empirically assess the reliability of different strategies of probe selection, we ran the collapseRows function on the brain and blood data sets described above, using 1.max, 2.var, 3.kMax, and 4.kVar. For each pair of gene expression matrices, we took the subset of genes in common and correlated the ranked mean expression and the ranked connectivity between data sets for each of the four collapsing strategies (see Figure for a typical example; while the scatter plots do not reveal a strong relationship, we should point out that the connectivity correlations are significantly different from zero and non-negligible - at p < 10-8 the correlation coefficient is highly significantly different from zero). We then determined the average correlation of these two measures across data set pairs in human brain (Figure ), mouse brain (Figure ), and human blood (Figure ) and assessed how frequently and to what degree these measures were highest for each collapsing strategy. Overall, we found that 1.max was the most robust method, particularly with regards to the expression correlation; in other words, choosing the probe with the highest mean expression nearly always lead to the most comparable results between these data sets. We should point out that numerous alternative approaches for collapsing probes per gene are possible. Some of them are discussed below.
Figure 2 When collapsing probes to genes, 1.max is usually the optimal collapsing strategy to choose. A) A typical example of ranked expression (left column) and ranked connectivity (right column) correlation between two data sets. Each dot represents a gene in (more ...)
Example 2: choosing one centroid per gene expression module
The collapseRows function allows one to aggregate the genes that make up a co-expression module. A co-expression module is sometimes defined via a clustering method, but any module assignment can be specified with the row-grouping variable of the collapseRows function (also see examples 3 and 4 below). We assessed the reliability of the following three different strategies for collapsing module genes into a single representative per module: 1.max, which chooses the module gene with highest mean expression; 3.kMax, which chooses the most highly connected intramodular hub gene; and 5.ME, which defines the first principal component of the module genes, resulting in the module eigengene. We focused only on these three collapsing methods since they are often used in network based data reduction methods [13
To perform this analysis, we started with expression data from a single data set in each group (GSE3790B for human brain, GSE4734 for mouse brain, and [9
] for human blood) and defined co-expression modules using WGCNA, since it is a widely used method for defining co-expression modules [3
]. Modules were defined as branches of a hierarchical clustering tree. To increase the robustness of our results, different choices of branch cutting parameters implemented in the dynamicTreeCut R library [20
] were chosen to define 28 different module assignments. For each choice of module assignments (grouping variables), we then ran collapseRows on the subset of seven human and eight mouse brain data sets run on the HG-U133A and MG-U430A Affymetrix platforms, respectively, and on all five human blood data sets from example 1. Finally, we calculated the resulting ranked mean expression and ranked connectivity correlations between the different data sets as described above (Figure ). If rows represent genes in a module, the connectivity of the collapsed variables measures how closely related the module is to other modules. Several studies have found that these correlations between modules may reveal higher order relationships between them [1
Figure 3 When collapsing genes to modules, the optimal method depends on the goal of the analysis. A-C) Across several studies in human brain (A), mouse brain (B), and human blood (C) the MaxMean parameter generally produces better ranked expression correlations (more ...)
For human brain (Figure ), mouse brain (Figure ), and human blood (Figure ) the ranked mean correlation measure suggests that 1.max is the best collapsing strategy, whereas the ranked connectivity measure suggests that 3.kMax is a better collapsing strategy. Furthermore, in the case of mouse brain, 3.kMax actually leads to higher ranked connectivity correlations than the standard method of calculating the module eigengene (5.ME). Together, these results suggest that different parameter options should be chosen depending on the goal of the analysis and the specific input data being used. For example, if the goal of the analysis were to determine correlation patterns between modules then 3.kMax would be a more appropriate collapsing strategy than 1.max.
Example 3: predicting cell type abundances in mixtures of pure cell lines
Most microarrays measure tissue RNA from tissue homogenate rather than from a pure cell type. It has been recognized that markers for pure cells can sometimes be used to estimate the proportion of pure cells that make up the tissue homogenate. For example, standard expression deconvolution (based on a multivariate regression model) can be used to simultaneously estimate the abundance of multiple cell types [7
]. While using a multivariate linear model may afford greater efficiency in estimating cell type abundances, it may perform poorly if it is incorrectly specified, for example due to omission of an unknown cell type. Our collapseRows based approach to expression deconvolution addresses a more limited, but related question: how well can expression levels of a set of marker genes for a single cell type (group) be used to estimate the relative abundance of that cell type between samples? Thus, the rows of the input matrix correspond to marker genes and the grouping variable assigns markers to their respective cell types. The collapsed variable per group is used as an abundance measure for the pure cell line.
To evaluate the performance of expression deconvolution methods, Abbas and colleagues [7
] created a data set that involved mixing four distinct "pure" cell lines--Raji and IM-9 (from B cells), Jurkat (from T-cells), and THP-1 (from monocytes)--using pre-specified proportions. These known proportions could be used as gold standard measurement for judging the performance of methods that aim to estimate cell type abundances. Included in the expression data are microarrays run on each pure cell line, as well as microarrays run on four mixtures of these cell lines in known proportions, each in triplicate [7
We applied the collapseRows function to the microarray data from the cell mixture study described above. We first chose the top 500 marker genes for each of the four cell lines by finding the genes with the highest fold change enrichment in a given cell line compared with the other three. We then ran the collapseRows function on the four mixtures of cell lines, using these 2000 marker genes as rows and the four cell lines as groups, and using several collapsing methods. Finally, we scaled these collapsed expression matrices to obtain the predicted proportions of each cell type across samples. We should emphasize that a limitation of our deconvolution approach is that it can only estimate the proportion of cells up to a constant. While our prediction is expected to be correlated with the true proportion across samples, it cannot be used to estimate the true proportion. The situation is similar to predicting degrees Fahrenheit based on another temperature scale: 0 degrees Fahrenheit does not correspond to 0 degrees Celsius, but a linear transformation can relate Fahrenheit to Celsius. It is often sufficient to use such an approach in the context of measuring cell type abundance, e.g. when correlating cell type abundances to a clinical outcome, because the correlation measure is scale invariant.
To assess the performance of different marker gene collapsing methods, we correlated the predicted abundances (proportions) of pure cell types with the true proportion determined by the mixture proportions (Figure ). We find that all four tested methods of collapsing genes--1.max, 3.kMax, 5.ME, and 6.Avg--produced significant correlations (R > 0.8), but that collapsing based on maximum connectivity (3.kMax) led to by far the most accurate predictor (R = 0.994). In this application, collapsing rows by choosing the module eigengene (5.ME) leads to the least accurate predictions.
Figure 4 collapseRows accurately predicts the relative quantity of blood cell lines across mixed samples. Four prediction methods were used on data from (Abbas et al 2009), from which both gene expression data and actual blood cell counts were known: A) maximum (more ...)
Example 4: Predicting cell type abundances in peripheral human blood
Finally, to assess whether these methods can accurately predict cell type abundances in a more realistic model, we obtained microarray data from a deconvolution study of peripheral blood both pre- and post-kidney transplant [10
]. Similar to the previous data set, these data include microarrays run on pure cell populations (CD4+ and CD8+ T cells, and CD19+ B cells) as well as on whole blood; however, these arrays were run using actual human blood and the constituent blood cell populations were purified using magnetic beads from the same samples used for whole blood assays. To assess the performance of different marker gene collapsing methods, we correlated the predicted abundances (which up to a constant are proportions) of pure cell types with the true proportion determined by flow cytometry data (described in [10
]). Again, we find high correlation between true and predicted cell proportions for these cell types (Figure ; 0.5 < R < 0.6 in most cases), although unsurprisingly, these real cell subset predictions for whole blood are not as accurate as the highly contrived strategy of using fixed combinations of transformed cell lines by Abbas and colleagues [7
]. We also varied the number of marker genes per cell type used to form predictor. Collapsing on connectivity (3.kMax), ME (5.ME), and average (6.Avg) all lead to comparably good predictors. Furthermore, choosing N~100 marker genes produces the best results, although these three predictors are all fairly robust to the number of marker genes chosen. Choosing the gene with the maximum mean expression (1.max), however, is a much less robust method, and should not be used in this situation.
Figure 5 collapseRows accurately predicts the relative quantity of cell type across samples of whole blood. Using data from a realistic blood model (Grigoryev et al 2010), the 1.max, 3.kMax, 5.ME, and 6.Avg collapseRows aggregation strategies can still predict (more ...)
Data aggregation using collapseRows leads to increased reproducibility
Despite advantages, data aggregation or collapsing will inevitably lead to a loss of information. Therefore, it is important to assess whether the loss of information has an impact on analysis results and if so what the impact looks like. Most of our empirical analyses use reproducibility (as measured by correlation of ranked mean expression and connectivity) as a measure of how much usable information remains in the data after aggregation. We now extend these analyses to assess how selective removal of information using collapseRows affects reproducibility, using the subset of 7 human brain samples run on the HG-U133A array and the subset of 8 samples run on the MG-U430A array. We measured the correlation of ranked mean expression and connectivity for comparisons on the probe level, and compared those with our best result using collapseRows (1.max; see Additional file 3
- Increase in reproducibility using collapseRows). In human we find average expression correlations of R = 0.87 and 0.87, and connectivity correlations of R = 0.23 and 0.33 for the probe-level and gene-level comparisons, respectively. In mouse the comparable correlations are R = 0.93 and 0.93, and R = 0.15 and 0.24 for probe-level and gene-level comparisons. Together, these results suggest that, while present, the loss of information using collapseRows actually leads to increased reproducibility, at least as measured by correlation of ranked connectivity.
The function userListEnrichment facilitates biological interpretation of gene lists
After collapsing gene lists into groups, either for the purpose of finding co-expression modules or for determining marker genes, it is often useful to relate the collapsed data to cell type markers. For example, when using collapseRows to predict cell type abundances in a data set that does not contain control samples from pure cell lines, it would be useful to be able to compare each gene group against lists of known marker genes for the expected cell type before using these groups for deconvolution. Such a biological indicator is particularly useful when studying brain, where it is almost never possible to precisely separate cell types as an experimental control. To address this issue, we have created the function userListEnrichment, which is also available in the WGCNA library [5
]. This function measures enrichment of input lists of genes (i.e., groups found using collapseRows) with respect to pre-defined or user defined collections of brain- and blood-related lists curated from the literature (see the help file for specific references cited). Files containing user-defined lists can also be input as reference lists directly. userListEnrichment measures enrichment using a hypergeometric test, after which significant enrichments are output to a file that can then be used for further biological assessments. While biological assessment of userListEnrichment is beyond the scope of this article, we present a tutorial for how to perform a co-expression analysis using both collapseRows and userListEnrichment on our website: http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/JMiller/