Partially powered GWAS and ensuing meta-analysis have identified a number of non-HLA candidate genes associated with MS susceptibility [11
]. Each significant association has a very modest effect, representing a small share of the genetic variance affecting disease risk. In this follow-up study of the meta-analysis dataset, we applied logistic regression stepwise selection methods and identified 350 variants. We used these markers to build a genetic profile associated with the cumulative genetic risk measured by the probability of an individual being a MS case. In the validation dataset, we tested the model and found that the classification algorithm yielded 62.3% sensitivity and 75.9% specificity, with an AUC of 0.769. These numbers together indicate that the application of the genetic profile built using the meta-analysis discovery dataset does not provide a high discriminatory accuracy in the independent dataset despite a median cumulative genetic risk in the discovery dataset of 0.90 for the case group, and 0.01 for the control group. For the validation dataset, the values are 0.59 for the case group and 0.32 for the control group.
In order to better understand the magnitude of variance explained by different sets of genes in the logistic regression models, adjusted R2 (Nagelkerke's R2) of different models using the discovery and validation datasets were compared (summarized in Table ). This analysis assigns to the HLA-DRB1*15:01 allele approximately 7% of the total variance in the predictive model. The 11 validated genes explain about 3% of the remaining variance in the discovery dataset and 2% in the validation dataset. For the 350-gene set, the 349 genes in addition to HLA-DRB1 in the model explain 49% and 17% of the total variance in the discovery and validation datasets, respectively. The estimated cumulative genetic risk in the validation dataset using the 12 validated genes did not show significant differences between the case and control groups (Figure ). On the other hand, the 350-gene set contributed to improved classification sensitivity, from 54.3% (12 genes) to 62.3% (350 genes) in the validation process (Table ). Furthermore, when using only the 12 genes, all DRB1*15:01-negative individuals in the validation dataset were classified as controls, which explains the higher specificity observed in the 12-gene-set models and its lack of discriminatory power for DRB1*15:01-negative individuals. Finally, the 350-gene set includes 6 markers in the MHC region other than DRB1, and these are associated with the largest observed P-values. In order to assess if they play a surrogate role when calculating the cumulative genetic risk (P-Hat) in the genetic profile, we used logistic regression condition on DRB1*15:01 (+/-) to assess R2 of the six MHC variants. The total variance accounted for these non-DRB1 MHC genes is 2.1% in the discovery dataset, and 2.6% in the replication dataset.
The percentage of variance (R2) explained by predictors in the regression model
Figure 3 Distribution of the estimated cumulative genetic risk (P-Hat) of case and control groups using the 12-gene set and 350-gene set in the validation dataset. P-Hat is the estimated cumulative genetic risk (the probability of being a MS case). The median (more ...)
Several factors could have contributed to the relatively low sensitivity of the selected genes. First, the power of the discovery dataset is more likely inadequate to detect all susceptibility genes. Even though we have used the largest MS genetic dataset available to date, it has been suggested that a dataset with 10,000 cases and 10,000 controls might be able to reach a desirable level of power for GWAS analysis in order to effectively control both type I and type II errors. This is especially valid for less frequent alleles (minor allele frequency ≤10%) and effect size (odds ratio) in the range 1.1 to 1.3 [35
]. Second, relevant MS variants may have gone undetected because of the partial genome coverage in the currently available SNP arrays. Third, there are unknown interactions between genes involved in the biochemical pathways that contribute to MS susceptibility. Fourth, the total adjusted R2
of the logistic regression model is 0.75 and the r-square attributable to genetic factors in this model accounted for only 56.5%, suggesting that without fitting environmental triggers into the model, predictive accuracy will remain limited. A large number of environmental exposures have been investigated in MS, but recent epidemiologic and laboratory studies have provided support primarily for vitamin D and Epstein-Barr virus exposure [37
]. A recent study suggests that adding environmental risk factors into a predictive algorithm based on genetic variants enhances the case-control status classification [39
]. Fifth, due to the suboptimal power in the discovery dataset, it is likely that the selected 350 variants include both true and false signals. The inclusion of false positives in the estimators that fit the discovery dataset does not contribute to the prediction in the validating process, also causing a tractable drop in classification accuracy. Thus, the results shown in Table may contain a portion of overestimation of model fit in the discovery dataset analysis results, indicating that bias could be embedded in predictive modeling when using the association tests approach in marker selection.
All these confounders are reflected in the fact that some individuals in the control group carry a high cumulative genetic risk (P-Hat >0.8). Thus, in this experiment utilizing the most updated MS genetic dataset, a high cumulative genetic risk is not sufficient to predict with high confidence affectation status even in the discovery dataset (Table ). Additional layers of complexity are represented by the likelihood of unaccountable epistatic interactions, etiological heterogeneity, and epigenetic and random events. These limitations notwithstanding, the genetic risk as assessed here still captures a significant portion of the full cumulative genetic risk (the probability of being a MS case) in the validation dataset between the case (median = 0.59, 75% quartile = 0.74) and control group (median = 0.32, 75% quartile = 0.49). The model with the 350-gene set produced a larger difference of the estimated cumulative genetic risk between case and control groups compared with that produced by the 12-gene set in the models (Figure ). Thus, the cumulative genetic risk (P-Hat) generated using the 350-gene set can still provide a useful index of the genetic load associated with MS, and provides important mechanistic insights.
Most validated MS susceptibility loci have well-defined roles in immunologic functions, consistent with the hypothesis that MS etiology has its primary roots in early immune system dysregulation, precipitating secondary neuronal degeneration. On the other hand, a network-based pathway analysis of two GWAS in MS, where evidence for genetic association was combined with evidence for protein-protein interaction, demonstrated the role of neural pathway genes (axon guidance and long-term potentiation) in conferring susceptibility [26
]. The genetic profile identified in this analysis confirms the significant enrichment of genes involved not only in the immune response but also in nervous system development and neuronal signaling (Table ). These included genes encoding cell-cell adhesion molecules (CDH2
, and NRXN3
) and several neuronal receptors, such as the G-protein coupled receptors (ADRA1A
, and HTR2A
), as well as the metabotropic glutamate receptor (GRM8
) and ionotropic glutamate receptors (GRIK4
). Interestingly, members of the glutamate receptor pathway have been previously identified by our group in both the network-based study of one of the GWAS datasets included in the meta-analysis utilized here (GRIN2A
] and an independent pharmacogenomic study of type I interferon response (GRIA1
]. A more recent pharmacogenomic study also identified the ionotropic glutamate receptor (GRIA3
) associated with interferon response in MS [41
]. These observations further support the proposed mechanism of glutamate excitotoxicity as a precipitating agent of the glial and axonal injury observed in MS [42
]. The ramifications of these SNPs on expression or function are unknown; however, their recent and continued identification may help evolve a model of MS pathogenesis with increasing contributions from neuronal genes.
In summary, the cumulative genetic risk estimation using a genetic profile composed of 350 genes provides a useful index of the genetic risk leading to MS. The incomplete classification accuracy reflects most likely the limited power of available genetic datasets and the difficulties in incorporating gene-gene interactions and gene-environment interactions. The imminent publication of larger high-resolution GWAS and transcriptomic studies together with recent progress in identifying true environmental variables will refine this and other modeling approaches for a greater understanding of MS genetics and assessment of translational applications.