In this section we present steps for conducting a Weighted Correlation Network Analysis (WGCNA) of tumor expression data to identify patient groups that have high, moderate, and low survival. We then present results from applying WGCNA to a breast cancer data set consisting of 26 markers measured on 82 patients. We compare the WGCNA results to a more conventional multimarker analysis approach and then show that the WGCNA results validated in two of three Affymetrix gene expression HG133A array data sets.
Steps for conducting a Weighted Correlation Network Analysis (WGCNA) of patients
Figure 1 Overview for conducting a Weighted Correlation Network Analyses (WGCNA) of patient TMA data (Steps 1-4) and follow up analyses (Steps 5-7). Steps 1-4 are numbered to correspond with the WGCNA methods section in the text. After defining WGCNA and WGCNA* (more ...)
1. Create a patient correlation network from tumor marker expression data
We used WGCNA to identify clusters of patients whose tumor marker profiles were positively correlated. In this analysis, patients are considered "nodes" of the network, and edges between them are determined by correlations across the set of tumor markers. WGCNA was performed using R software functions (indicated in courier font) provided in the WGCNA R package [10
There are two types of weighted correlation networks, "unsigned" and "signed". An unsigned network is based on the absolute value of the Pearson correlation coefficient, while the signed network is based on the conventional Pearson correlation coefficient ("cor"). Specifically, the network adjacency (connection) between a pair of samples xi
is defined as a(i,j)
= (0.5 +
where the power β
facilitates a soft thresholding approach that emphasizes high positive correlations at the expense of low or negative correlations. We recommend a signed correlation network approach for comparing patient expression profiles since it is unlikely that negatively correlated samples are molecularly similar. In practice, we find that patients have moderate to high positive expression correlations across protein markers. Based on the network adjacency matrix, we define the following dissimilarity measure between the samples: dissA
. While other network dissimilarities have been used in correlation network analysis (e.g. the topological overlap based measure [32
]) we recommend dissA
since it leads to clusters of positively correlated samples.
2. Define patient groups (modules) from the patient network
The sample network dissimilarity dissA
can be used as input of a clustering procedure. Here we used average linkage hierarchical clustering using the flashClust WGCNA function. Clusters of patients were defined as branches of the resulting cluster tree. To "cut" the branches of the tree, we used the cutreeDynamic R function since it affords more flexibility than traditional approaches and has been carefully evaluated in several simulation studies where it was shown to retrieve the true simulated module structure [34
]. By construction, the resulting clusters of patients (also referred to as groups or "modules" of patients) have positively correlated expression profiles across the tumor marker set. The molecular profiles of each module can be represented using the first principal component (referred to as an eigensample). The module eigensample (ME) is a vector of length equal to the number of tumor markers, that captures the maximum amount of expression variation in a given module. It can also be interpreted as a weighted average of the expression values across the patients belonging to the module. The MEs of different modules can be correlated with each other to determine whether two highly correlated clusters should be merged. ME's with high correlation can be merged to reduce the patient network to a manageable number of modules using the mergeCloseModules function.
Clinical variables can be visualized using color-bands underneath the dendrogram (cluster tree) to visually evaluate or refine merging parameters using plotDendroAndColors (see Figure for an example). We optimized the module merging process to correspond with patient mortality. As a result, we identified three patient clusters with low, moderate and high mortality rates. Since our optimized module merging process may have overfit the data, we evaluated the prognostic accuracy of the resulting clusters in three independent data sets as described below.
Figure 2 Results of a WGCNA of 82 breast cancer patients and 26 markers. A. Markers were clustered according to their expression levels across patient samples, so that each branch of the tree indicates a patient. The first row of white, grey and black colors below (more ...)
3. Evaluate the utility of WGCNA groups for survival prediction
To understand the clinical meaning of the three patient clusters (referred to as WGCNA groups) we studied the relationships between the groups and clinical variables. Conventional survival analysis methods such as Kaplan-Meier plots and log-rank tests were used to assess survival prediction, and we used a Kruskal-Wallis and Fisher's exact test to relate the WGCNA mortality groups to continuous and categorical clinical variables, respectively. We used a log rank test to confirm that the WGCNA patient groups were highly related to cancer survival, and then checked that these patient groups could not be exclusively defined by clinical variables (stage, grade, Her2+, ER+, PR+, tumor size, lymph node involvement and metastasis). We also evaluated survival prediction of the WGCNA groups while controlling for other predictive clinical variables in a multivariate Cox proportional-hazards model.
4. Use classification trees to identify key markers for defining WGCNA groups
Depending on the number of markers analyzed, it may be practical to reduce the full marker set to a few key markers that would be more manageable in a diagnostic setting or validation analysis. After confirming that the WGCNA patient groups were predictive of survival in both univariate and multivariate analyses, we used recursive partitioning (or classification tree methodology) implemented in the rpart R function to identify a few markers that could approximate our WGCNA patient groups. The resulting approximate patient groups, or "WGCNA*" were related to clinical variables and evaluated in a multivariate Cox proportional-hazards model (as in step 3).
In summary, WGCNA and WGCNA* are both categorical grouping variables that attempt to classify low, moderate and high mortality risk groups according to their TMA marker expression data. The WGCNA variable uses data from the complete set of markers, and the WGCNA* variable approximates the WGCNA categories by identifying optimal thresholds for a small subset of these markers. Since a threshold can be defined in relation to its parent distribution, i.e. as a percentile, the WGCNA* classifier or "rule" can easily be evaluated in additional independent data sets.
Application of WGCNA to a tumor expression array breast cancer data set
We applied the WGCNA methodology outlined above to a high-density TMA platform consisting of 26 putative tumor biomarkers measured on 82 breast cancer patients (Tables , ). The patients were clustered by their expression profiles, which were transformed to adjacencies using the soft threshold (power) β = 10 and the signed network option for calculating adjacency in the WGCNA R package. The cutreeDynamic function was used (with options minClusterSize = 2 and deepSplit = 3) to generate 16 modules (not shown). Since we did not expect that these modules were robustly defined given the relatively small data set, we merged similar modules using the mergeCloseModules R function with the "cutHeight" parameter set to 0.15, which was optimal for obtaining fewer but more robustly defined modules that were significantly associated with survival (p-value = 3.9 × 10-4). Since the groups were defined with respect to the survival outcome, the p-value is overfit and should be interpreted as a descriptive (not inferential) measure. To arrive at an unbiased evaluation of the patient groups, we used independent gene expression data sets as described below. The WGCNA groups corresponded to mortality rates of 5.4%, 22%, and 50% (colored white, grey and black, respectively in Figure , , ).
Summary statistics for trait data on 82 patients
Summary statistics for TMA markers on 82 patients
To test whether the median values of ordinal variables differed between WGCNA patient groups, we used the Kruskal Wallis test, which is a non-parametric multi-group comparison test. Boxplots were used to visualize the distribution for each group. Lymph node involvement, stage, metastasis and estrogen receptor positivity were significant at the 0.05 level, but none of these variables could completely define one or more of our patient groups (data not shown). Furthermore, an ordinal multivariate regression model predicting our patient groups from these four variables resulted in a McFadden pseudo R-square of only 0.068 (SPSS v16.0). After verifying that our WGCNA groups were distinct from our clinical variables, we evaluated its utility for survival prediction in the presence of other predictive variables. Variables that were significant at the 0.05 level in univariate Cox proportional-hazards models included lymph node involvement, stage, metastasis and Her2 positivity. A multivariate Cox proportional-hazards model that included these variables and our WGCNA variable found all predictors to be non-significant (p > 0.05) except the WGCNA mortality groups, where the high mortality group had a p-value of 0.037. These results suggest that the WGCNA mortality groups have distinct molecular characteristics that predict breast cancer survival independently of prognostic clinical variables. However, the WGCNA cluster variable was defined with respect to the 26 markers, which is cumbersome to validate. Therefore, we aimed to develop a simple classification rule (referred to as WGCNA*) which assigns each patient to its respective WGCNA cluster. Toward this end, we used classification trees implemented in the rpart R package, which automatically selects significant markers and corresponding thresholds. The classification tree led to a WGCNA* rule based on three markers p53, Na-KATPase-β1, and TGF β receptor II; with optimal thresholds corresponding to the 75th, 33rd and 66th percentiles, respectively (Figure ). Due to missing data in the p53, Na-KATPase-β1, and TGF β receptor II markers (2, 4 and 8 missing values, respectively), the WGCNA* rule initially resulted in only 66 patients with mortality group assignments. As a result, we confirmed that the missingness pattern of each marker was unrelated to survival and then replaced missing values by the median (rather than the average due to skewed distributions), so that all 82 patients were assigned to a WGCNA* mortality group. By construction, the WGCNA* mortality groups closely matched the original WGCNA groups, differing by only 10 patients (Table ). As a result, the WGCNA* patient groups were highly predictive of patient survival (p-value = 9.1 × 10-5, Table ), with mortality rates of 5.4%, 24%, and 67% (Figure , ). The WGCNA* rule suggests that breast cancer patients with high p53 and low Na-KATPase-β1 have a high risk of death in comparison to other molecular profiles. Furthermore, patients with low p53 and high TGF β receptor II have a moderate mortality risk.
Figure 3 The WGCNA* and COX mortality group definitions. A. Classification trees were used to identify a subset of markers (3 out of 26 total) and their optimal thresholds for approximating the WGCNA groups. Nearly 88% (72 matches out of 82) of the mortality group (more ...)
Comparison of WGCNA* mortality group assignments to WGCNA and COX
Mortality comparison between WGCNA*, WGCNA and COX groups
To elucidate the clinical meaning of the WGCNA* groups, we related them to clinical variables. We found that the WGCNA* groups are significantly related to stage, metastasis and estrogen receptor positivity (p < 0.05, Figure ). A multivariate Cox proportional-hazards model that included the prognostic clinical variables (lymph node involvement, stage, metastasis and Her2+) had a slightly lower R2 (0.306 versus 0.326) and hazard ratio (3.8 versus 5.9) for the high mortality group when WGCNA* was used as a predictor rather than the original WGCNA grouping variable based on all 26 markers (Table ). However, these differences are negligible given that WGCNA* substantially reduced the number of markers needed for validation.
Figure 4 Variable and marker boxplots by WGCNA* mortality group. Kruskal-Wallis p-values are reported for the comparison of each variable and marker to the WGCNA* patient groups, where the WGCNA* patient groups are color coded to indicate low (white), moderate (more ...)
Survival prediction of WGCNA, WGCNA* and COX groups in a multivariate Cox proportional-hazards (CPH) model
We verified that no single marker could define the WGCNA groups (Figure ). Cox proportional-hazards models that included WGCNA* and each marker individually (dichotomized at its optimal survival prediction threshold) found WGCNA* to be the top predictor in all cases (data not shown).
Comparison of WGCNA* patient groups with a conventional step-wise analysis
While WGCNA defines patient mortality groups that predict survival independently of other clinical variables, it is interesting to know how this approach compares with a more conventional stepwise variable selection approach [37
]. We analyzed the same TMA data by first dichotomizing each marker at an optimal threshold for survival prediction. Ten markers could be dichotomized at a level that achieved a minimum of five patients per group and a univariate Cox proportional-hazards model p < 0.05. We included these 10 markers in a multivariate Cox proportional-hazards model and removed the least significant predictor in a step-wise manner, until all remaining variables achieved significance at the 0.05 level. Four markers: MED28 expressed in the cytoplasm, p53, Smad4 expressed in the cytoplasm and Her2 were retained. The resulting model had an R2
of 0.321. We then used classification trees (rpart with complexity parameter 0.1) to identify a subset of markers and their thresholds for defining low, moderate and high mortality patient groups. This was achieved with two markers MED28 and Smad4, dichotomized at their 75th
percentiles, where high MED28 resulted in high mortality and low MED28 in conjunction with high Smad4 resulted in moderate mortality (Figure ). The resulting "COX" patient groups were significantly related to survival (p = 1.6 × 10-4
) and had mortality rates of 4.1%, 33% and 50%. While the COX variable had similar mortality rates to WGCNA and WGCNA*, there was a substantial 35% difference in the assignment of patients to mortality groups (Table , Additional File 1
). A multivariate Cox proportional-hazards model that included the COX variable and the prognostic clinical variables, found the COX variable to be the best predictor with p-values of 0.016 and 0.036 for the moderate and high mortality groups, respectively (Table ). Finally, to directly compare the WGCNA* and COX variables, we included both in a Cox proportional-hazards model with and without the prognostic clinical variables. The WGCNA* high mortality group was the most significant predictor (p = 0.017) in the absence of clinical variable data, but when prognostic clinical variables were included, the COX moderate and high mortality groups were the only significant predictors in the model.
In summary, the WGCNA* and COX mortality groups are distinct from each other and are both important predictors of breast cancer survival in our TMA data. While COX outperforms WGCNA* in the presence of prognostic clinical variables, it was also created by optimizing the significance of its underlying markers in a Cox proportional-hazards model. Thus, the COX variable's superior performance could possibly be explained by over-fitting in our TMA data set and may not validate in other data sets. To test this, we attempted to validate the WGCNA* and COX mortality group rules in independent gene expression data sets.
Validation analysis of WGCNA* and COX groups in gene expression data sets
We applied the WGCNA* and COX rules to three independent Affymetrix HG-U133A data sets (GSE3494, GSE1456, GSE2990) [24
]. After data cleaning (as described in the Methods section) the data sets consisted of 207, 146 and 173 breast cancer patients for the Miller 2005, Pawitan 2005 and Sotiriou 2006 data sets, respectively (Additional File 2A
). The data sets had mortality rates ranging from 17%-27%, with the Pawitan 2005 and Sotiriou 2006 data sets being most similar to the 16% rate in our TMA data. Other than survival, grade was the only clinical variable common to all three data sets, and it appeared fairly consistent across the gene expression and TMA data sets. The probe set distributions were also consistent, where the interquartile ranges overlapped for all Miller 2005 and Pawitan 2005 probe sets; which overlapped with Sotiriou 2006 in most cases (Additional File 2B
). There were no anomalies to disqualify a data set or probe set for validation analysis.
On the Affymetrix HG-U133A array, the protein Na-KATPase-β1 was represented by gene ATP1B1 which had two probe sets 201242_s_at and 201243_s_at. Similarly, p53 corresponded to the TP53 gene (201746_at and 211300_s_at); and TGF β receptor II corresponded to TGFBR2 (207334_s_at and 208944_at). Thus there were eight versions of the WGCNA* rule corresponding to each probe set combination (2 × 2 × 2). Similarly for the COX rule, the proteins MED28 cyt and Smad4 cyt were represented by genes MED28 (214831_at, 218438_s_at) and SMAD4 (202526_at, 202527_s_at) resulting in four versions of the COX rule per validation data set. The strongest WGCNA* validation was achieved in the Pawitan 2005 data set, where all eight probe set combinations validated for the high mortality group (p < 0.1, Table , Figure ). In the Miller 2005 data set, two probe set combinations validated. No probe sets combinations validated in the Sotiriou 2006 data set. Applying the same validation criteria (p < 0.1), the COX rule did not validate in any data set for any probe set combination (data not shown). In summary, the WGCNA* high mortality group validated in two of three Affymetrix breast cancer data sets, suggesting that patients with high TP53 and low ATP1B1 may have a worse prognosis than patients with other profiles. Since high p53/TP53 is a well known indicator of poor prognosis, the following section verifies that the combination of Na-KATPase-β1/ATP1B1 and p53/TP53 in both the protein and gene expression data sets is a stronger survival predictor than p53/TP53 alone.
Validation results for WGCNA* mortality groups in three Affymetrix data sets
Figure 5 Validation of the WGCNA* high mortality group in two independent gene expression data sets (A-B). A. Results for the Miller 2005 data set are shown for the following probe sets ATP1B1: 201242_s_at, TP53: 201746_at, and TGFBR2: 208944_at. The Pawitan 2005 (more ...)
A comparison of p53 and the WGCNA* high mortality group
Since the WGCNA* high mortality group was defined by low Na-KATPase-β1 and high p53, we checked whether p53 alone would be a sufficient or possibly superior survival predictor. In our TMA data, the optimal dichotomized threshold for p53 was the 75th percentile. In a Cox proportional-hazards model that included both the WGCNA* high mortality group (coded as high versus moderate and low combined) and the dichotomized p53 marker, the hazards ratio for the WGCNA* high mortality group was more than two-fold higher at 4.5 (p = 0.07) versus a hazards ratio of 2.1 (p = 0.38) for the dichotomized p53 marker. In the gene expression data, the continuous form of the TP53 variable was not significant while the WGCNA* high mortality group maintained significance at the 0.05 level for all 8 of the Pawitan 2005 models. The dichotomized TP53 marker did achieve significance at the 0.05 level in two of the Miller 2005 models, but high TP53 indicated a protective effect, which is inconsistent with current (protein-level) findings (HR = 0.35 and p = 0.04 for both models). In summary, low Na-KATPase-β1 (< 33rd percentile) in combination with high p53 (>75th percentile) is a stronger predictor of mortality than p53 alone in both our TMA data and the Pawitan 2005 gene expression data set.
Analysis of the WGCNA* high mortality group in the Pawitan 2005 data set
Since the WGCNA* high mortality group consistently validated in the Pawitan 2005 data set, we explored the relationship between this group and the available Pawitan 2005 variables: subtype (Basal, ERBB2, luminal A, luminal B and normal like) and grade (I-III). In this data set, the high mortality group consisted of 11 patients, 7 of which were luminal B, one luminal A, one basal, and two were missing subtypes. Thus, subtype was significantly related to WGCNA* high mortality (Fisher's exact test p = 1.5 × 10-4
). Similarly, seven of the high mortality group patients were grade 3, two were grade 2, and two were grade 1, although this relationship did not achieve significance (Fisher's exact test p = 0.194). In a multivariate Cox proportional-hazards model with subtype coded as luminal B versus other types and grade coded as an ordinal variable, the WGCNA* high mortality group was the strongest predictor with a hazards ratio of 4.22 (95% CI: 1.3, 14.1, p = 0.019). See Additional File 3
for characteristics of the WGCNA* mortality groups.