Protein complexes and pathways are accountable for most processes in the cell. Accordingly, we can gauge the response of a cell system to a certain perturbation (such as disease) by the measuring changes in the expression levels of various functional modules of the system. To this end, we first generated a catalog of 4,620 functional modules by querying the large-scale human protein interaction network (see Methods
). We then collected the mRNA expression arrays associated with each disease from the Gene Expression Omnibus (GEO) 
. After several rounds of filtering the gene expression data for accuracy, reliability, and experimental context, we had microarrays representing 54 human diseases (see Methods
and Table S1
). Next, we combined the gene expression data and the 4,620 functional modules to generate a Module Response Score (MRS) for each module in each disease-state representing its activity level (see Methods
). Specifically, positive MRS values correspond to modules that are up-regulated and negative MRS values identify modules that are down-regulated in the disease-state as compared to the control (healthy). gives an overview of the process to compute the MRS values for a given disease. In the end, we generated a matrix containing the MRS values for each module in each of the 54 diseases considered in this study. The relationships between different diseases were then ascertained by the Partial Spearman correlation coefficient of their MRS values (see Methods
and Figure S1
). Specifically, we calculated the Spearman correlation between two diseases conditioned on the responses of the functional modules in their respective control samples. The use of the Partial Spearman correlation coefficient instead of the generic Spearman correlation coefficient not only provided a quantitative metric to assess disease similarity but also explicitly factored out the possible dependencies between different gene-expression experiments due to their underlying tissue or cell types.
Overview of the process to generate the module response scores for each disease.
is the hierarchical clustering of diseases based on the correlations generated above. To assign significance to these associations, we randomized the gene to module assignments as well as the control and disease labels 100 times to generate a background distribution of disease correlations (see Methods
). We then selected only those disease correlations that passed the p
-value threshold of 0.01 (FDR
10.37%) resulting in 138 significant disease-disease similarity relationships. Immediately, we see that many expected disease associations such as the brain disorders like Alzheimer's disease, Bipolar disorder and Schizophrenia are pooled together in one sub-branch. We also see many novel and hitherto unknown significant correlations such as the similarity between uterine leiomyoma and lung cancer. We also created a network representation to display all the 138 significant disease correlations (). In this network, the nodes are diseases, while the thickness of the edges between two diseases represents their strength of correlation. This abstraction allows us to pick additional significant disease associations that were missing in the hierarchical clustering. For example, Crohn's disease and Malaria share a significant disease correlation. A listing of all the significant disease correlations is provided in Table S2
Significant disease-disease similarities.
Although the 54 diseases considered in this study cover many categories of diseases ranging from cancers to cardiomyopathies, some categories of diseases such as cancer are over-represented as opposed to others such as infectious diseases. Ideally, we would like to explicitly correct for this bias by down-weighing over-represented classes. However, the principle behind organizing diseases into categories such as cancers, infectious diseases and others is not the same. For instance, diseases are classified as cancers if their underlying pathology consists of a group of cells that show uncontrolled growth, invasion of nearby cells and metastasis. On the other hand, infectious diseases relates to diseases which are caused by pathogens and have the potential to spread from person to person. Lack of a common organization scheme prevents us from explicitly correcting for the observed over-representation. Moreover, there is considerable heterogeneity even among diseases of the same category. For instance, the category of cancers covers a wide variety of diseases affecting many different cell types and having many different biological causes ranging from mutations caused by chemical carcinogens to bacterial and viral infection. This heterogeneity is seen even at the transcriptional level 
. We also have observed this heterogeneity in the results of our study as all the 17 cancers considered in our analysis did not cluster together (). By combining both mRNA expression and protein interaction data, we are providing one of the first ways to compare and classify diseases systematically. The common organizing principle here is the molecular pathology of a given disease.
At the outset, we explored the genetic basis of the diseases in our study to explain and validate the observed disease correlations. Specifically, we aimed to test the hypothesis that diseases which are significantly associated through the MRS-based correlation coefficient also significantly shared disease genes. For this purpose, we collected a list of genes known to be associated with diseases, hereinafter as the Disease Gene List (see Methods
). We found known gene variants associated with only 31 of the 54 diseases in our study resulting in an overall total of 465 possible pair-wise disease comparisons. A pair of diseases was considered to significantly share disease genes only if the Hypergeometric p
-value of the overlap was less than 0.01. Eighty-two of the overall 465 comparisons significantly shared disease genes. On the other hand, only 73 of the 465 disease pairs were significantly associated using the MRS-based correlation coefficient. This gives rise to a contingency table as shown in with a one-sided Fisher's Exact Test p
-value of 0.033. It suggests that the genetic similarity between diseases significantly contributes to the molecular pathological disease similarity observed in this study. Lack of a strong p
-value might be explained by the fact that the number of known disease genes are much higher for well-studied diseases like Schizophrenia (345 genes) as opposed to less well-studied diseases like Mixed hyperlipidemia (4 genes). Mapping of genes to diseases was also hindered due to fact that we used a very strict vocabulary to define diseases (see Methods
). Finally, this result might also allude to the role of environment in disease causation and similarity. A few of the significant disease correlations which also significantly shared disease genes is provided in and the complete list is provided in Table S3
Contingency table to evaluate the hypothesis that significant disease correlations also significantly shared disease genes.
Genetic similarity between significant disease correlations.
In order to further understand the biology behind the observed disease correlations, we examined some of their underlying functional modules. First, we analyzed the sub-branch of brain disorders, Alzheimer's disease (ALZ), Bipolar disorder (BIP), Schizophrenia (SCHZ), and Glioblastoma (GLIO), in the hierarchical representation of the disease correlations () in more detail. corresponds to the synaptic vesicle and was one of most down-regulated modules in all four diseases (second lowest average MRS value). This module is a secretory organelle that stores neurotransmitters and releases them into the synapse. Loss of synaptic functions and more specifically, decreased expression of synaptic vesicle proteins such as SNAP-25 is one of the main effects of ALZ 
. Decreased synaptic function has also been observed for both BIP and SCHZ 
. In particular, the levels of protein SNAP-25 was shown to be reduced in both BIP and SCHZ 
. The function of this module in GLIO is still to be explored. Uterine leiomyomas (UTL) are benign tumors affecting the uterus. As shown in , UTL shares a strong correlation with lung cancers. corresponds to the DNA repair pathway which had the highest average MRS value for the three diseases. Polymorphisms in the genes involved in the DNA repair pathway such as PCNA, POLB have been associated with increased risk of lung cancer 
. Moreover, the Arg399Glu allele of the XRCC1 gene has been shown to be a risk factor for lung adenocarcinoma 
and lung squamous cell carcinoma 
. Surprisingly, the same Arg399Glu polymorphism in the XRCC1 gene has also been associated with an increased risk of UTLs 
giving causal genetic evidence for the correlation we observed between the diseases using microarray-based molecular pathological measurements.
Knowledge of a comprehensive disease-similarity tree (network) based on molecular data could possibly be used in finding new uses for existing drugs. Similar diseases share similar molecular phenotypes and could potentially be treated by similar drugs. To explore this avenue, we collected a list of drugs, their corresponding target genes and the diseases they are known to treat (US FDA approved indications) or off-label uses. This information was obtained from the RxNorm from National Library of Medicine 
, DrugBank 
, National Drug File Reference Terminology (ND-FRT) 
and MicroMedex 
. Overall, 17 of the 138 significant disease correlations shared at least one drug in common and 14 of them had a significant Hypergeometric p
-value less than 0.01 (, Table S4
). For instance, we found that the FDA approved drug Flouroucil, used to treat Actinic keratosis, has been shown to have positive indications for treating Malignant tumor of the colon 
. Similarly, the drug Doxorubicin is FDA approved to treat both Urothelial carcinoma and Acute myeloid leukemia 
. This number is a conservative estimate as the list of drugs used here is incomplete. Moreover, we used a very specific vocabulary to define diseases (see Methods
) and accordingly mapped drugs to them. For instance, we found many drugs treating lung cancer; however in many cases, our combined knowledge base doesn't specify whether the cancer was an adenocarcinoma or a squamous cell carcinoma. In those cases, we excluded the drug from our consideration. A caveat to this approach is that drugs can be shared between diseases mainly because the corresponding diseases belong to the same category. For instance, drugs can be shared between two cancers etc. As a result, it is difficult to differentiate whether two diseases shared drugs due to the similarity in their molecular pathology or due to their underlying disease type. Moreover, the chemical similarity between drugs can also affect the reported p
Shared drugs among significant disease correlations.
Another consequence of elucidating and quantifying the response of the cell system to a disease is that we can use this methodology to find modules that are generally dysregulated (activated or repressed) in the disease-state. In other words, we used the MRS values to characterize a common “signature” across disease-states. In order to generate the set of modules that are commonly dysregulated in the 54 diseases considered in this study, we used a two-fold approach. Firstly, a module was selected if the median of its absolute MRS values across all diseases was significantly higher than expected at random. We generated a random background distribution of median scores by shuffling the gene to module assignments (see Methods
). Overall, at a p
-value of 0.01 and associated FDR of 16.15%, we selected 286 modules. We then filtered the above set of 286 modules to only include those modules which were significantly differentially expressed in many diseases. A module was determined to be significantly differentially expressed in a given disease if the absolute value of its MRS was above 1.5 (p
0.028). Finally, we selected 59 modules that were significantly differentially expressed in 20 or more diseases as the common disease state signature. These modules were not only dysregulated in at least half of the diseases each but were also significantly differentially expressed in more than 20 diseases. Moreover, these 59 modules taken together were dysregulated in 45 of the 54 diseases in our study. Figure S2
shows the combined illustration of all the 59 modules. They were mainly enriched for the functions of immune system response (p
6E-70) and DNA repair (p
4.1E-30). A representative sample of 7 modules is shown in .
We investigated the 59 modules further by searching for known drug target genes/proteins. We obtained the list of drugs and their corresponding targets from the DrugBank database 
. Overall, 70 genes/proteins within the 59 signature pathways were identified as targets of known drugs giving a Hypergeometric p
-value of 1.8E-11. Thus, the set of the signature modules was significantly enriched for drug target genes compared to that expected by chance. We then predicted that other genes/proteins in these modules would also serve as prime candidates for designing new drugs. Most existing drug target genes usually fall into a comparatively small set of gene families such as G protein coupled receptors, serine proteases etc 
. Hence, new drug targets can be found by exploring other members of the protein families of the existing drug targets. We explored the 59 signature modules for genes which belonged to the same protein families as known drug target genes. For that purpose, we obtained a list of genes and their corresponding families and sub-families from the PANTHER database 
. Overall, we found 241 genes among a total of 450 genes in the signature modules sharing the same protein families as the known drug target genes compared to a total of only 3,520 such genes in the whole human PPI giving a Hypergeometric p
-value of 1.47E-12. Therefore, the 59 signature modules were also significantly enriched for druggable genes. Further, we also counted the number of distinct diseases that are known to be treated by the drugs corresponding to each of the 70 known drug targets. We observed that drugs targeting these 70 genes are known to treat an average of 65 diseases each compared to an average of ~42 diseases for all known drug targets (p
0.02). These results provide evidence that the genes in the signature modules are more likely to be good drug targets and drugs that target these proteins are more likely to treat many diseases. Yildirim et al. 
showed that most drugs seemed to be palliative and only cured the symptoms of the diseases rather than the diseases themselves. Therefore, the enrichment for drug target genes which treat many diseases might be due to the shared symptoms of the diseases.
In summary, this study demonstrates the value of an integrated approach in revealing disease relationships and the resultant opportunities for therapeutic applications. Looking forward, we aim to incorporate more gene expression data from GEO and other similar repositories, and expand the set of diseases in our disease-similarity network.