Standard gene expression analysis based on peripheral blood
We collected blood data from three independent patient sets (Table ). Controls were matched for age and gender, with mean age between 62 and 65 at time of blood collection. We obtained a discovery dataset of 30 patients and 30 controls (dataset 1), a replication dataset of same size (dataset 2) and a third dataset with 63 cases and 63 controls (dataset 3) amounting to a total of 123 ALS patients and 123 controls. Datasets 1 and 2 have similar proportions male/females and spinal/bulbar onset in patients. Dataset 3 has more male subjects (60%) and more spinal onset patients (78%).
Clinical information on ALS patients and controls.
Using a Student t-test for comparing gene expression between ALS cases and healthy controls, we calculated a statistical significance level (p-value) for each of the 24365 probes present in the discovery set. At a false discovery rate of 0.05, 2300 probes are differentially expressed between ALS cases and controls. In Additional File 1
, we report the fold change, the p-value, and the Benjamini Hochberg correction for each of these differentially expressed probes. A drawback of a standard analysis is that it ignores the strong correlation patterns between probes, which may lead to an erroneous estimate of the false discovery rate. But a more serious drawback of the standard analysis is that it fails to see the forest for the trees. Below, we will show that two large clusters of genes (modules) relate to ALS disease status. These modules turn out to be highly enriched with known gene ontologies which provides insights into the pathogenesis of ALS.
To predict ALS status based on the gene expression profiles, we used two alternative prediction methods: a random forest predictor and a k-nearest neighbor (with k = 10). Unbiased test set estimates of the prediction accuracy show that both predictors classify 80% of the samples correctly. While the accuracy is relatively high (and reflects the fact that ALS cases are molecularly distinct from controls), it is not clear whether a molecularly predictor of ALS (versus healthy controls) would be clinically relevant. A more basic research question is to identify disease related pathways and gene networks since this may lead to insights regarding the disease etiology and possible treatment regiments. Gene co-expression network methods have been successfully applied in a variety of different settings [28
Weighted gene co-expression network analysis
Here we used weighted gene co-expression network analysis (WGCNA) [42
] in a first attempt to identify ALS associated coexpression modules and their key constituents. WGCNA starts from the level of thousands of genes, identifies modules of co-expressed genes, and relates these modules to clinical variables and gene ontology information. Because gene modules may correspond to biological pathways, focusing the analysis on modules (and their highly connected intramodular hub genes) amounts to a biologically meaningful data reduction scheme.
Highly correlated module genes are represented and summarized by their first principal component (which is referred to as the module eigengene [46
]). The module eigengenes are used to define measures of module membership which quantify how close a gene is to a given module. Module membership measures allow one to annotate all genes on the array and to screen for disease related intramodular hub genes [44
]. As described below, we use functional enrichment analysis with regard to known gene ontologies to understand the biological significance of module genes and to identify putative disease pathways.
Detection of co-expression modules related to ALS
We applied WGCNA to probes with a significant mean detection level (p < 0.05). Hierarchical clustering applied to the discovery set (data set 1) led to the identification of five co-expression modules ranging in size from 199 to 842 genes (Figure ). As can be seen from the color-band underneath the cluster tree, modules correspond to branches and are color-coded (Blue, Yellow, Turquoise, Red and Green module). Grey is used to color background genes that are not grouped into any module.
Figure 1 Gene co-expression modules in human whole blood. Detection of gene co-expression modules in human whole blood datasets comprised of ALS patients and matched controls. (a) Branches of the cluster dendrogram of the most connected genes gave rise to five (more ...)
To assess the robustness of the co-expression module definition, we replicated module detection in the second and third dataset (Additional File 2
) where we colored the genes according to the module color in data set 1. The fact that genes of the same color stay close together in the three different cluster trees (Additional File 2
) shows that the Blue and Yellow module are highly preserved across the three data sets.
Differentially expressed genes tend to be in the Blue or Yellow module
Our module definition was solely based on the gene expression levels in peripheral blood and ignored ALS disease status. To incorporate a clinical outcome into the network analysis, WGCNA makes use of suitably defined gene significance measure. Here we defined the gene significance measure as the Student t-test statistic for testing differential expression between cases and controls. Thus, a large absolute value of the gene significance measure corresponds to a small 2-sided p-value. We found that the gene significance measures in the three independent data were highly correlated (Figure : r = 0.73 for dataset 1 vs. dataset 2 with p < 2.2 × 10-16 and r = 0.71 for dataset 1 vs. dataset 3 with p < 2.2 × 10-16). Thus, the gene significance measures is highly reproducible across the three data sets. Figure shows that genes that are consistently up-regulated in ALS cases tend to be part of the Blue module while genes that are consistently down-regulated tend to be in the Yellow module.
Figure 2 Scatterplots showing strong preservation of gene significance across the three independent datasets. The scatterplots include the network genes colored by their module assignment in dataset 1. t-test statistic value for dataset 1 (x-axis) was compared (more ...)
Two co-expression modules are significantly associated with ALS
We defined a measure of module significance as average absolute gene significance across the module genes. Figure shows that the Blue and Yellow module genes were highly enriched with differentially expressed genes in the 3 independent data sets. The module significance (mean absolute Student t-test statistic) of the Blue module corresponds to p-values 0.0006, 0.005 and 0.0007 in datasets 1, 2, and 3, respectively. The module significance of the Yellow module genes corresponds to p values 0.006, 0.01 and 3.5 × 10-7 in datasets 1, 2 and 3, respectively. Combining all datasets resulted in highly significant p values: 3.0 × 10-6 for the Blue module and 1.3 × 10-8 for the Yellow module.
An alternative and statistically preferable way of relating a module to ALS disease status is to correlate disease status with a suitably defined module representative. As module representative, we used the module eigengene (ME) which is defined as the first principal component of the module expression profiles. The correlations between ALS status and the module eigengene of the Blue module (referred to as MEblue) were r = 0.48 (p = 3.6 × 10-5), r = 0.37 (p = 2.6 × 10-3), and r = 0.41 (p = 6.2 × 10-7) in data sets 1, 2, and 3, respectively. Note that the p-values remain highly significant even after using the most stringent multiple comparison correction (Bonferroni correction) since only 5 comparisons (corresponding to 5 modules) were carried out. This illustrates a statistical advantage of WGCNA: instead of correcting the analysis for tens of thousands of gene comparisons, a module-based analysis involves orders of magnitudes fewer comparisons. For the Yellow module eigengene MEyellow we found highly significant negative correlations with ALS status: r = -0.61 (p = 6.2 × 10-9), -0.50 (p = 1.3 × 10-3), -0.61 (p < 1.0 × 10-22) in data sets 1, 2, and 3, respectively. The negative correlations reflect that most Yellow module genes were under-expressed in ALS patients.
No significant relationship between modules and other clinical variables
We related the module eigengenes to other clinical variables but did not find any other significant associations. In particular, MEblue and MEyellow were not significantly associated with age at time of collection, gender, specific characteristics of ALS patients such as bulbar or spinal onset, age at onset or El Escorial criteria at time of collection. A multivariate Cox regression analysis that regressed survival time on the module eigengenes, site of onset, sex and age at onset, resulted in no significant p-values for any of the covariates.
Using module membership values to annotate the genes with regard to module membership in the data sets
As detailed in the Methods section, we made use of a fuzzy measure of module membership (MM) that can be defined for each module. The module membership measure with regard to the Blue module MMblue(i) = Cor(xi, MEblue) is defined as the correlation between the i-th gene expression profile xi and the Blue module eigengene. Large absolute values of MMBlue(i) indicate that the gene is close to (or part of) the Blue module. In contrast, if MMBlue(i) is 0, then the ith gene is uncorrelated with the Blue module eigengene and is unlikely to be part of the Blue module. The sign of module membership encodes whether the gene has a positive or a negative relationship with regard to the Blue module expression profiles.
We also used a correlation test to compute a corresponding p-value of module membership. We found that the module membership measures of the Blue and Yellow modules are highly preserved across the three data sets as can be seen from Additional File 3
In Additional File 4
, we report the individual module membership values with regard to the different modules in each of the data sets and the mean module membership values across the three independent data sets is referred to as MeanMMblue
The WGCNA R package also computes a gene selection score (referred to as p.weighted) based on gene significance and module membership [45
]. Analogous to a p-value, the smaller the value of p.weighted the stronger is the evidence that the gene is a disease related hub gene.
Ingenuity pathway analysis of four top 500 gene lists
As detailed in the Methods section, we used the module membership values to define four different gene lists. The first and second gene lists consisted of the top 500 genes closest to the Yellow and the Blue module, respectively. The third gene list consisted of 500 genes with the lowest WGCNA gene selection score p.weighted. For comparison with a standard differential network analysis, we also drafted a fourth list of genes according to the average Student t of differential expression across the three data sets.
In the Methods section and Additional File 4
, we provide details on these gene lists including p-values and local false discovery rates (q-values). We used Ingenuity Pathways Analysis (IPA, http://www.ingenuity.com
) to test for enrichment with regard to known gene ontologies. A detailed comparative functional enrichment study of the four lists is presented in Additional File 5
and a condensed version involving only the first 3 lists can be found in Additional File 6
We find that a standard differential analysis (black horizontal bars in Additional File 5
) leads to less significant findings than those of a module based analysis (see categories cellular compromise, infectious disease, genetic disorder, skeletal/muscular disorder, dermatological diseases, connective tissue disorder). This provides indirect evidence that a module centric analysis of these data leads to more significant biological findings.
Detailed enrichment analysis results for the Blue module
Here we provide a detailed description of the most important functional enrichment of the 500 genes with highest module membership value in the blue module.
Post-Translational Modification was the most significant category with p-values ranging between 4.4 × 10-4 and 4.5 × 10-2. Specifically, the following genes are involved in this category: BTG1, BMI1, CAND1, CD47, CD48, CHUK, CLK1, CLK4, CUL2, DNAJA2, DPM1, DUSP11, DUSP12, ELF1, HDAC2, HSP90AA1, MAP3K7, NAE1, PCMT1, PCNP, PPP1CB, PPP1CC, PPP2CA, PPP3CB, PTPN11, RB1CC1, SET, SH2D1A, SIAH1, SIRT1, SLC35A1, SUZ12, UBA3, UBE2N, ZDHHC17. This category included several subcategories, including modification of protein (p = 4.4 × 10-4), neddylation of protein (5.9 × 10-4), refolding of protein (8.3 × 10-3), tyrosine dephosphorylation of protein (8.3 × 10-3), moeity attachment of protein (1.1 × 10-2), deacetylation of protein (1.5 × 10-2), acetylation of protein (4.5 × 10-2) and methylation of amino acids (4.5 × 10-2).
Infection Mechanism was also a highly significant category with p-value range: 5.9 × 10-4 – 4.8 × 10-2. In particular, the genes ATG5, BNIP2, CD46, CHUK, DEK, NGLY1, PRNP, RAB11A, SFRS1, TBK1, TFRC, UBP1, WASL (includes EG:8976), WIPF1 and XPO1 were involved in Infection Mechanism. This category included mobility of vaccinia virus (5.9 × 10-4), replication of virus (1.9 × 10-2), binding of virus (4.6 × 10-2), infection of Influenza virus (2.4 × 10-2), penetration of human herpesvirus 6A (2.4 × 10-2) and penetration of human herpesvirus 6B (2.4 × 10-2).
RNA Post-Transcriptional Modification (p-value range 9.5 × 10-4-2.8 × 10-2) included the following genes: DCP2, DHX15, DNAJB11, DUSP11, EIF4A2, HNRNPH3, IVNS1ABP, NCBP2, PRNP, RNGTT, SFRS1, SFRS2, SFRS3, SFRS6 and SFRS7. This category had subcategories including modification of RNA (9.5 × 10-4), splicing of RNA (1.7 × 10-3), processing of RNA (3.4 × 10-3), decapping of mRNA (2.4 × 10-2), dimerization of tRNA-Lys (2.4 × 10-2) and selection of splice site (2.8 × 10-2)
Neurological Disorder (8.5 × 10-3-4.8 × 10-2) involved the following genes: ACADM, AP1S2, ATP6AP2, B2M, CAB39, CDKN1B, CHMP2B, CRBN, DDX1, EIF3E, GALC, GHITM, GNA13, HMGB2, HMGCR, HSP90AA1, IFNGR1, IMPA1, ITPR1, IVNS1ABP, L1CAM, NDUFB5, NFE2L2, OSBPL8, PCMT1, PPP1CB, PPP1R2, PRNP, PTGS2, RAB1A, RAB11A, RAB3GAP2, RAB5A, SLC9A6 TOMM20 and TRAM1. This category contained subcategories Huntington's disease (1.6 × 10-2), atrophy of dendrites (2.4 × 10-2), pseudobulbar paralysis (2.4 × 10-2) and degeneration of myelin figure (4.8 × 10-2).
Detailed enrichment analysis results for the Yellow module
Here we provide a detailed description of the most important functional enrichment of the 500 genes with highest module membership value in the yellow module.
Genetic Disorder was one of the most significant categories with p-values ranging between 8.3 × 10-7 and 4.7 × 10-2. Specifically, the following genes are involved in this category: ACADVL, ADA, ALG1, ANAPC2, ATXN2, CD4, CDK9, CDK2AP2, COG1, DPAGT1, FLNA, GSS, GUSB, IDUA, IKBKB, IL2RG, ITGAL, MAP2K5, MAP4K1, MCM5, NOD1, NDUFS7, NDUFS8, NDUFV1, SDHA, SIGIR, SLC35C1, SMPD1 and USP11. This category included several subcategories, including psoriatic arthritis (p = 8.3 × 10-7), congenital disorders of glycosylation (2.0 × 10-4), Leigh syndrome (2.2 × 10-4), and spinocerebellar ataxia, type 2 (2.7 × 10-2).
Neurological Disease was another significant category with p-value range: 2.0 × 10-5 – 2.7 × 10-2. In particular, the following genes CLCN7, CLN3, DIAPH1, HSD17B10, HSP90AB1, HD, NDUFS7, NDUFS8, NDUFV1, NFKB2, SDHA and VAC14 were involved in Neurological Disease. This category included neurodegeneration of nervous system (2.0 × 10-5), neurodegeneration of central nervous system (2.7 × 10-2), neurodegeneration of peripheral nervous system (2.7 × 10-2), Leigh syndrome (2.2 × 10-4), Huntington's disease of nervous syndrome (2.7 × 10-2), oxidative stress response (2.7 × 10-2), fragmentation of striatal neurons (2.7 × 10-2) and spinocerebellar ataxia, type 2 (2.7 × 10-2).
Inflammatory Disease (p-value range 8.3 × 10-7-4.0 × 10-2) included the following genes: ADA, ANAPC2, CDK9, CDK2AP2, IKBKB, ITGAL, MAP2K5, MAP4K1, MCM5, NOD1, POLD1, SIGIRR and USP11. This category had subcategories including psoriatic arthritis (8.3 × 10-7), acute pancreatitis (2.7 × 10-2), and keratitis (4.0 × 10-2).
Skeletal and Muscular Disorder (8.3 × 10-7-2.70 × 10-2) involved the following genes: ANAPC2, BIN1, CDK9, CDK2AP2, FLNA, IKBKB, ITGAL, MAP2K5, MAP4K1, MCM5, NOD1, SIGIRR and USI. This category contained subcategories psoriatic arthritis (8.3 × 10-7), Melnick-Needles syndrome (2.7 × 10-2), disorganization of myofibrils (2.70 × 10-2), otopalatodigital syndrome (2.7 × 10-2).
Detailed Ingenuity analysis of the top 500 genes selected by WGCNA
The third list of 500 genes with lowest WGCNA score p.weighted are comprised of genes that are highly related to ALS status and are highly connected intramodular hub genes in the Blue and/or the Yellow module. An Ingenuity analysis of these top 500 genes revealed the following most significant categories: Cellular Compromise (8.5 × 10-5 – 2.8 × 10-2) and Post-Translational Modification (9.8 × 10-5 – 5.0 × 10-2). Cellular Compromise included genes ABCA1, ARHGDIA, CD46, CD55, EXOC7, HMGB2, HD, NFE2L2, PKN1, PLEC1, TBPL1 and VPS26A. This category included degeneration of epithelial cells (8.45 × 10-5) and damage of neuromusclar junctions (2.8 × 10-2).
Post-Transcriptional Modification included 57 genes, and the following subcategories: modification of protein (9.8 × 10-5), modification of amino acids (4.7 × 10-4), modification of L-proline (4.5 × 10-3), modification of L-amino acid (4.8 × 10-3), moeity attachment of amino acids (9.7 × 10-4), moeity attachment of L-amino acid (6.40 × 10-3), hydroxylation of L-amino acid (4.5 × 10-3).
We also applied to the biomarker search option of Ingenuity to these top 500 genes.
The results can be found in Additional File 7
which reports i) tissues where matching genes have been found to be expressed, and ii) known drugs for matching genes. We report these preliminary results to illustrate how WGCNA coupled with Ingenuity Analysis can be used for generating hypotheses that may form the starting point of future studies.
Functional enrichment analysis with DAVID of the top 100 disease related intramodular hub genes
We also carried out a functional enrichment analysis with the data base DAVID [49
]. Here we selected the top 100 most highly connected genes in both modules. Additional File 8
reports highly significant gene ontology categories and representative genes.
The most significant pathway is the Huntington's Disease pathway (Fisher's exact test p = 8.0 × 10-5). Other interesting significant pathways include mRNA processing, the neurodegenerative disorders pathway (p = 0.024), the axon guidance pathway (p = 0.025), and the phosphatidylinositol signaling system (p = 0.038).
Figures and visualizes the connectivity patterns of the 100 intramodular hub genes in the Yellow and Blue modules, respectively. Edges between the intramodular hub genes indicate significant correlations. Both modules contain hub genes involved in apoptosis and protein ubiquitination. Several hub genes in the Blue module are known to be involved in response to stress and vesicle transport. Several hub genes in the Yellow module play an important role in mitochondrial functioning [50
Network of hub genes in the Yellow module colored by gene ontology functional information. Hub genes are connected by an edge if the correlation between their expression profiles is significant.
Network of hub genes in the Blue module colored by gene ontology functional information. Hub genes are connected by an edge if the correlation between their expression profiles is significant.