By taking advantage of our previously developed dense module search method, we proposed an alternative search strategy in this work and demonstrated it in the CATIE GWAS dataset, one of the major available GWAS datasets for schizophrenia. Additionally, we explored the different options to define gene-wise P values, including the VEGAS-all method, which built on all the SNPs mapped to a gene, the VEGAS-top method, which used the top 10% SNPs mapped to a gene, and the minP method, which used the most significant SNP. By applying our restricted search strategy in each of the three data sets, we showed that the VEGAS-all method generated the smallest number of module genes and was least affected by other potentially confounding effects such as gene length. The other two methods resulted in similar numbers of module genes. These results call for caution when selecting different methods to compute gene-wise P values, which may have significant influences on the resultant module genes prioritized for the disease.
The restricted search strategy is intended to reduce the overlap among modules. Assuming that a local environment of the background network includes 5 nodes, namely A, B, C, D, and E. Starting from node A, a module including A, B, C, and D would be generated at Zm+1>Zm × (1+r). Starting from B, a module including B, C, D, and E would be generated. In our previous strategy to apply DMS, both modules would be reported, even though they had 75% overlapping genes. In the current strategy, to resolve the issue of overlap, we starts with the node that has the highest weight, e.g., A, to search for the module. And then we would remove the module genes from the background network after it is done, e.g., the nodes A, B, C, and D would be removed from the network and, thus, from further analysis. In this way, the module starting from B would not be reported, as most nodes in it have already been removed from consideration. This ensures that each node in the network could be analyzed once and will be involved in only one module. Both methods have their own advantages. The traditional one performs a comprehensive search and allows every node in the network to have the chance of being a seed. The computational intensity is high and redundancy among modules is strong. Furthermore, the correlation among modules posts challenges for the follow up statistical tests when selecting modules. In contrast, the restricted strategy is computationally efficient by gradually shrinking the background network, and it ensures against physical overlap among modules. However, it may miss moderately significant genes that cannot be included in any module. In practice, either of the two strategies can be selected depending on the specific aims and project design.
Computation of gene-wise P values is one of the key steps in most post-GWAS analyses. There have been several methods and tools published to compute gene-wise P values. The most widely applied method in the field is to select the SNP with the smallest P value among all SNPs mapped to a gene, although this method is subjected to several known biases, such as gene length, SNP density, and the local LD structure. We selected VEGAS because of its advantages, such as acceptable computation time (<12 hours for a typical GWAS dataset like in our case) and no need of genotyping data. The rationale of including two formulations in VEGAS is that using all SNPs mapped to a gene (e.g., VEGAS-all method) is comprehensive but considering all SNPs potentially dilute the signals, while using part of the SNPs (e.g., VEGAS-top) may miss some informative SNPs but captures the most significant 10% SNPs for the computation.
However, VEGAS computes SNP-SNP matrix based on pairwise LD values and could only deal with autosomal SNPs. SNPs located on the sex chromosomes (X and Y) are not applicable for VEGAS and were removed from our network based analysis. Although these genes accounted for only a small proportion (3.9%) in the PINA network we used, more comprehensive algorithms that are able to handle all genes in the genome is needed for future work.
The module genes we identified, in any scenario, recruited neuro-related and/or immune-related genes and pathways. All three sets of module genes include well-studied candidate genes for schizophrenia (e.g., DTNBP1
), glutamate receptors (e.g., GRIN1
), several genes located in the MHC region (e.g., HIST1H1A
), and genes from the 14-3-3 protein family (e.g., YWHAQ
). Interestingly, all three module gene sets contain several genes in the MHC region, even though none of these genes passed the significance test for single markers at 5 × 10-8
. The MHC region has been shown to harbor significant association signals in a combinatory analysis of three GWAS datasets for schizophrenia [11
]. The identification of these genes by our DMS method further confirmed this signal. It also proved that network based analysis could reveal markers that, although they individually failed the single marker test, their joint affects on the disease might be significant.