GWAS results from two MS studies were analyzed to identify modestly associated variants within genes with related biological functions. The first dataset was produced by the International MS Genetics Consortium (IMSGC), and comprises 931 family trios genotyped with the GeneChip®
Human Mapping 500K Array Set (Affymetrix) (9
). Quality control for this dataset included sample genotyping efficiency, assessment of marker heterozygosity and allelic frequency, departure from Hardy–Weinberg equilibrium, gender consistency, reproducibility and population genetic structure. A total of 334 923 SNPs survived the quality control protocol and were tested for association with the trait. As expected, a number of markers in the HLA region were strongly associated with the disease phenotype. In addition, 78 markers outside the HLA region were found to exceed the P
< 1 × 10−4
genome-wide threshold of significance. The second dataset (the GeneMSA study, (3
)] was generated using the Sentrix®
HumanHap550 BeadChip (Illumina). After a similar quality control protocol, 551 642 SNPs were used to conduct an association analysis using the genotypic test in 978 cases and 883 controls (3
). In addition, the association of each individual marker with the disease was tested by fitting a logistic regression genotypic model in which gender, Center of sample origin and HLA-DRB1*1501
status were included as covariates. In the GeneMSA study, 87 SNPs outside of the HLA region exceeded the genome-wide significance threshold of P
< 1 × 10−4
. Although there was no full overlapping of associated markers between the two studies, several genes showed evidence of association in both (3
). A meta-analysis is being conducted and will be reported in the near future.
To carry out the protein interaction network-based pathway analysis (PINBPA), we computed a single P
-value for each gene (the gene-wise P
-value, Fig. ) and overlapped these onto a curated PIN. Many markers map within gene deserts or unannotated genes, and these were excluded from the present analysis. This process resulted in gene-wise P
-values for 14 442 and 17 342 genes for the GeneMSA and IMSGC GWAS, respectively. We next conducted sub-network searches on the two MS GWAS using the Cytoscape plugin jActive modules
. jActive modules
combines the network position and association P
-value of each gene to extract potentially meaningful sub-networks or modules. In addition to searching for significant modules using both datasets together, we also conducted individual searches for each study (data not shown). Although the same basic modules were identified in both strategies, higher scores were obtained when both datasets were used together, suggesting a real power gain when larger datasets are used. To assess the specificity of the modules associated with MS, we also performed equivalent analyses on recent GWAS from other autoimmune diseases (rheumatoid arthritis, RA; Crohn's disease, CD; type 1 diabetes, T1D), neurological diseases (Alzheimer's disease, AD; bipolar disorder, BP) and unrelated diseases (coronary artery disease, CAD; hypertension, HT; T2D) (16
). We observed statistically significant modules in all diseases (Fig. ). Interestingly, the largest number of significant modules was observed in MS, suggesting greater genetic heterogeneity in this disease when compared with others. To test to what extent significant network modules could be obtained by chance, we conducted 10 searches randomizing the P
-values among the same set of genes (those with association P
-values <0.05). With the notable exception of MS, RA and AD, the module scores obtained from the randomized P
-values were equal to or even higher than those obtained using the real P
-values (Fig. A). This observation suggests that many of these modules do not represent bona-fide biological networks and that their high scores may have been obtained by chance. In contrast, significantly fewer modules were identified in the searches based on randomized P
-values for MS, RA and AD suggesting that the significant modules obtained from the real P
-values in these diseases represent biologically meaningful networks. To examine in detail the magnitude of the scores for real and randomized P
-values, we plotted those of the top 20 modules for each set of P
-values for each disease (Fig. B). As expected from the previous analysis in MS, RA and AD, most of the top 20 modules obtained with real P
-values showed higher scores than the average score of the randomized searches. In the case of RA, only the top two modules show significantly higher scores than the average of their randomized searches (Fig. B). Notably, these two (partly overlapping) modules are composed exclusively of HLA genes, in which association with the disease is highly significant. Although the total number of modules obtained using real P
-values in BD and CD do not differ significantly from those obtained with the randomized P
-values (Fig. A), 18 of the top 20 scores were higher for the real P
-values (Fig. B). Altogether, these results suggest that the significant modules found with the original data may represent real effects of interacting proteins on each disease phenotype.
Figure 1. Strategy. A gene-wise P-value for association with MS in two independent studies was computed by selecting the P-value of the most significant marker within each gene. Genes with a P-value less than or equal to 0.05 (red circles) were selected for subsequent (more ...)
Figure 2. Module identification. (A) Number of significant modules (size <50 and score >3) identified by Jactive modules in MS and in a panel of other autoimmune (RA, T1D and CD), other neurological (AD and BD), and other unrelated (T2D, HT and (more ...)
Significant modules for MS
We identified 346 significant modules on the basis of their aggregate degree of genetic association with MS. Due to the nature of the search algorithm, several of these modules overlap extensively in their component genes. Thus, to describe modules representative of association with MS, we selected those with the highest scores which also displayed a minimum degree of overlap (Fig. ). Consistently with all previous genetic studies in MS, the most significant module (MS_I) included several HLA genes (Fig. A). Although the only gene consistently found associated to MS in this region is HLA-DRB1, the module shown lists another gene, HLA-DRA, as its most significant node. It is possible that because HLA-DRB1 is highly polymorphic, most SNP markers included in large-capacity arrays are not targeting this gene directly. Indeed, there are three times as many SNPs in HLA-DRA as there are in HLA-DRB1 in the Illumina 550 k platform and the DRA SNP rs313588 tags with high sensitivity the HLA-DRB1*1501 allele. The observed associations with other HLA genes like HLA-DMA/B and HLA-DOA/B may be also due to the extensive linkage disequilibrium (LD) seen in this region. Interestingly, HLA-DRB5, present in the most significant module, is part of the DR15 haplotype and has been identified as a potential modifier of the disease (17
). The other two HLA genes that are part of the DR15 haplotype (DQA and DQB) are not present in the module shown. Although the P
-values for each of these genes exceed the threshold of significance used for this analysis, they are not part of module MS_I because there is no evidence that they physically interact with any of its components. Not surprisingly, a KEGG pathway search with these genes identified the terms ‘antigen processing’, ‘cell adhesion molecules’ and ‘Immune system’ as the most significantly over-represented, relative to the number of genes in these pathways expected in the module by chance (Table ).
Figure 3. Representative modules for MS. Nodes represent proteins and connections represent physical interactions as determined by the curated human PID reported in Rual et al. (44). The size of each node is proportional to the −log (10) P-value of association (more ...)
Figure B shows another highly significant module characteristic of MS. In this module (MS_II), several HLA genes are also prominent members, but by virtue of highly connected molecules such as CD4, CD82 and ITGB2, a more extensive immune pattern emerged. Interestingly, two non-HLA susceptibility genes previously associated with MS (IL2Ra and CD58) also appear in this module. We hypothesize that in addition to its own significance, the presence of IL2Ra in this module may result from its physical interaction with STAT3 and ITGB2, which themselves show modest association with MS. CD58 was initially identified as a susceptibility gene in the IMSGC study (9
) and its expression was recently found to be upregulated in peripheral blood cells during disease relapses (19
). In contrast, IL7Ra, another gene recently identified as a susceptibility factor in MS (20
) is not part of this module. We speculate that it may act through an independent pathway. Although several of the other immune-related genes in this module have not been formally associated with MS, their involvement in disease pathology seems plausible. These include several cell adhesion (ITGB2, ITGA4, ITGA6, ITGAM and ICAM1) and signaling molecules (TGFBR2, TNFRSF10A and STAT3). Notably, ITGAM
(CD11b) has been recently associated with susceptibility to systemic lupus erythematosus, another autoimmune disease (21
). KEGG pathways analysis with genes from module MS_II revealed statistically significant over-representation of the processes of cell adhesion, leukocyte transendothelial migration and antigen processing (Table ).
Interestingly, the other two modules characteristic of MS (MS_III and MS_IV) suggest a neural component in the susceptibility to the disease. Module MS_III is highly enriched with genes typically expressed in neurons and glia (NCK2, EPHA3, EPHA4, FYN, EFNB1, EFNB2 and EPHB2). Similarly, module MS_IV includes seven glutamate receptors (GluRs) (GRIK1, GRIK2, GRIK4, GRIA1, GRIA4, GRIN2A and GRID2) in addition to HOMER1, DLG1 and DLG2. HOMER1 regulates group 1 metabotropic GluR function, and DLG1 and DLG2 interact at postsynaptic sites to form a multimeric scaffold for the clustering of receptors, ion channels and associated signaling proteins. The identification of the latter two modules in MS suggests for the first time that modestly significant associations in genes involved in neural pathways may contribute to the overall susceptibility to this disease. Indeed, when members of these modules were tested for membership to KEGG pathways, highly significant enrichment in axon guidance pathways (module MS_III) and long-term depression and potentiation pathways (module MS_IV) were detected (Table ).
As a control for our interpretation of these genes in MS, we next conducted similar analyses on the modules identified for other diseases. Interestingly, for two of the three autoimmune diseases tested (RA and T1D), the most significant modules were exclusively composed of HLA genes (Fig. ). On the other hand, only genes involved in the JAK-STAT signaling pathway (GRB2, JAK1, STAT3 and IFNAR1), and extracellular matrix-receptor interactions (CD44, COL4A2, COL1A1 and FN1), but not HLA were identified in the third autoimmune disease (CD). The two genes most robustly associated with CD (NOD2 and IL23R) are not part of the selected module. As described for module MS_II, this may be due to the fact that evidence for the interaction between these two genes and the rest of the genes in the module is lacking. As expected, the great majority of pathways identified in the significant modules for AD and BD were neural (Development, Parkinson's disease and long-term depression). Table shows the genes and pathways contained in the statistically significant modules identified for RA, T1D, AD and BD.
Representative modules for other diseases. Same conventions as in Figure . (A) RA; (B) T1D; (C) CD; (D) T2D; (E) CAD; (F) HT; (G) AD; (H) BD.
Significant modules for other autoimmune and neurological diseases
Although possibly representing false discoveries, the top modules identified for T2D, CAD and HT are also shown for comparison (Fig. B). In T2D, the most significant module contained genes involved in intracellular signaling (EGFR and BCR), apoptosis (IGF1R, AVEN and APAF1) and insulin receptor signaling pathway (IGF1R and IGF2). In HT, the top scoring module listed genes are almost exclusively involved in cell communication (EGFR, VAV3 and RAC1).
To assess module specificity, we compared the performance of each of them in the disease in which they were identified against its performance across all other diseases. This was accomplished by tabulating the gene-wise P-value of association of each gene in the module with every disease. If a module reached significance just because it was composed of large-sized genes, for which relatively low P-values could be obtained by chance, it would be expected that the same module be also significant in several or all other diseases, but modules are significant only in the disease in which they were originally identified, suggesting they were identified because they were disease-specific and not due to chance. As demonstrated in Figure , the four most significant modules identified in MS show almost no association with any other disease. However, there are some genes that show strong association with other diseases in addition to MS. In particular, the HLA genes also show highly significant associations with both RA and T1D, two autoimmune diseases. Most notably, several significant genes from modules in AD and BD are also significant in MS. For example, SNCA, CDC42EP3, FHL2 and CRMP1 all show P-values <1 × 10−3 in MS and AD or BD, but consistently higher P-values for all the non-neurological diseases. The maximum number of SNPs tested in these genes ranged from 49 (CRMP1) to 79 (CDC42EP3), slightly above the median number of 40 SNPs/gene across the Illumina 550 k array. In contrast, genes such as PARK2, VAV3, PAK7 and NTRK3 yielded relatively low P-values (P < 1 × 10−3) across most or all diseases, possibly because a larger number of SNPs were tested for these genes, and some achieved significance by chance. Indeed, the number of SNPs for these genes in the Illumina platform ranges from 149 (PAK7) to 455 (PARK2).
Figure 5. Module specificity. The P-values of genes from the representative modules shown in Figures and are displayed as a heatmap. Each row corresponds to a single gene. Genes are organized by their membership of modules. Genes corresponding (more ...)
Finally, we identified the 400 genes in which the gene-wise P
-values varied most widely across diseases, and performed one-way hierarchical clustering on these P
-values to produce a dendrogram identifying diseases with similar patterns of genetic association (Fig. ). The two MS and the two RA studies clustered almost perfectly with each other and they, in turn, were grouped together in a looser cluster which also included T1D (autoimmune) and AD (neurological), but did not include any of the unrelated diseases. Intriguingly, the study on CD did not cluster with MS and RA, but with the unrelated diseases. One possible explanation for this difference is the lack of a strong association with HLA genes in CD compared with the other autoimmune diseases. Instead, genetic susceptibility to CD appears to be more widely spread across the genome (16
Figure 6. Disease hierarchical tree. The P-values of the genes showing the most variable levels of association across diseases were selected to cluster the GWAS studies. Studies connected by the same branch of the dendrogram are more similar to each other than (more ...)