In this work we present two ways to integrate the information contents of multiple independent experiments for classification purposes. One approach is to construct multinet Bayesian networks, which constructs a network on each experiment and combine the results in assignment step, and the other way is to construct a single network on the union of samples from all experiments. While both methods can be used for integrating the information of multiple experiments, we found that generally multinets outperform singly structured networks.
Furthermore, we found that often times the most prominent genes were not present among the very top proportion of differentially expressed genes for each individual GEO experiment, but rather more toward the middle of the spectrum. The bias can be one factor that results in spurious differentially expressed genes for each experiment. In this regard, if all the conditions of multiple experiments were the same, the common list of differentially expressed genes across multiple experiments that we are considering is more robust to false positives. However, the application of our framework goes even beyond considering multiple experiments carried-out under exactly the same conditions or for exactly the same phenotype. Here in order to achieve to a general view of the phenotype, we can consider various experiments on different subtypes, outcomes, conditions, and even different tissues in a predictive setting. Therefore, the implicated genes are those that are presented in all considered aspect of the phenotype. These genes can be viewed as those genes that regardless of bias, subtypes, and conditions, are contributing to the phenotype.
For instance, none of the 21 intersecting genes among the top 35% of differentially expressed genes across all twelve GEO experiments relating to Obesity were present in the gold standard Obesity Gene Map list [
21]; however, when observing the 180 intersecting genes that were found from a larger proportion of the genes in each experiment, there were six genes also present on the gold standard list. Such a finding indicates that certain genes may have been overlooked in the past when studying these diseases; further research may focus on the effect and prominence of these seemingly “less” differentially expressed genes.
With respect to the genetic factors related to Huntington’s Disease, the automated pipeline found there to be several genes that have already been studied in the context of the disease (
RALA, CBX5, CALM3, GLG1, GLUL, MAPK8IP1, IMMT, MAP3K8, CDKN1A) [
22–
28]; however, certain genes were also discovered that have not been researched in regard to Huntington’s Disease. It has been shown that some of these genes (
SCT, LMNB1, IVD) [
29–
32] have some relation to certain neurodegenerative diseases or are involved in pathways that are relevant to the development of Huntington’s Disease. Although the study for which this paper was written did not look into such genes into great detail, these findings demonstrate the possibility that this pipeline may be able to propose novel candidates for disease-related genes.
A diagram showing the network of interactions of genes related to Huntington’s Disease is shown in . For all four diseases, the AUROC values for the models constructed using single net structure were much lower than those of the multiple experiments. From these numbers, it is easy to see that the integrative approach using multinets provides much better predictive models than do the single net structure.
The model created to represent genetic factors relating to leukemia also resulted in similar findings – several genes in this model have already been studied in varying degrees with respect to leukemia (
WT1, PDE4DIP, NCAM1, AKAP13, SLC35E1, HFE, JUN) [
33–
42]; however, a few novel genes were also presented in the model (
IVD, SYNJ2, TTLL3). Again, such findings demonstrate the power of this project’s methods in discovery of novel disease-specific genes.