|Home | About | Journals | Submit | Contact Us | Français|
The study of protein-protein interactions is essential to define the molecular networks that contribute to maintain homeostasis of an organism’s body functions. Disruptions in protein interaction networks have been shown to result in diseases in both humans and animals. Monogenic diseases disrupting biochemical pathways such as hereditary coagulopathies (e.g. hemophilia), provided a deep insight in the biochemical pathways of acquired coagulopathies of complex diseases. Indeed, a variety of complex liver diseases can lead to decreased synthesis of the same set of coagulation factors as in hemophilia. Similarly, more complex diseases such as different cancers have been shown to result from malfunctions of common proteins pathways. In order to discover, in high throughput, the molecular underpinnings of poorly characterized diseases, we present a statistical method to identify shared protein interaction network(s) between diseases. Integrating (i) a protein interaction network with (ii) disease to protein relationships derived from mining Gene Ontology annotations and the biomedical literature with natural language understanding (PhenoGO), we identified protein-protein interactions that were associated with pairs of diseases and calculated the statistical significance of the occurrence of interactions in the protein interaction knowledgebase. Significant correlations between diseases and shared protein networks were identified and evaluated in this study, demonstrating the high precision of the approach and correct non-trivial predictions, signifying the potential for discovery. In conclusion, we demonstrate that the associations between diseases are directly correlated to their underlying protein-protein interaction networks, possibly providing insight into the underlying molecular mechanisms of phenotypes and biological processes disrupted in related diseases.
Currently, common diseases are mainly defined by their clinical appearance, with little reference to their molecular mechanism. For example, syndromes are defined in medicine as a set of phenotypes which, occurring together, serve to define a trait or disease. These phenotypes overlap in the case of many syndromes. This overlap brought about the concept of ‘syndrome families’ though consideration of the commonality of features shared between diseases . Conceptually, what we have learned about 2000 human single gene diseases with a defined genetic phenotype is that each monogenic disease has a specified collection of specific phenotypic features. For example, hemophilias with deficiencies in coagulation factors, otherwise called hereditary coagulopathies, are single gene diseases with clear Mendelian inheritance that have provided significant insight in the biochemical pathways of acquired coagulopathies. Indeed, a variety of complex liver diseases can lead to decreased synthesis of the same set of coagulation factors as in hemophilia, leading to the same disease phenotype despite very different causes. In some cases, the clustering of syndromes into these families in combination with genetic insights has led to the discovery that what were often thought as two different disorders were really variable expressions of the same disorder [2–4]. Conversely, it has long been known that mutations at different loci can lead to the same genetic disease . It has also been hypothesized that this genetic heterogeneity has its roots at the protein interaction level, suggesting that other genes associated with the phenotype also have some functional role . Therefore, it is plausible to theorize that phenotypic overlap of diseases may reflect, at multiple biological scales, the relationships and functional properties of shared underlying molecular networks. As signal transduction pathways are less understood than biochemical pathways, protein-protein interactions networks provide unique opportunities for exploring the signaling pathways of diseases.
The shift in focus to systems biology has resulted in an increased interest in biological pathways and protein-protein interaction networks. As a result, large scale knowledge bases representing them are being rapidly developed [7–14]. These resources enable us to study complex biological problems using high throughput computational tools. While there is a wealth of protein-disease relationships in the published literature and a number of readily computable protein-protein interaction resources, there has been a paucity of work relating diseases using protein interactions from this kind of knowledge. Making use of these networks is a relatively new challenge in the field. Network-based analyses have been developed with a number of goals in mind , including protein function prediction , identification of functional modules , interaction prediction [18–21], and the study of network structure and evolution [22–26].
To explore the possibility of using protein-protein interaction networks to identify correlations between diseases, we hypothesize that protein-protein interactions shared by two diseases or more can be accurately identified in a protein interaction network by integrating knowledge from the literature and using statistical methods.
The method reported in this paper utilizes the PhenoGO database [www.phenoGO.org] that provides protein-GO-phenotype relations and the human-curated Reactome knowledgebase  that provides protein interactions to link protein-protein interactions with diseases. The recently developed PhenoGO database provides phenotypic context to protein-GO annotations, as an example, lymphoid tissue (a phenotype) is linked to interleukin 2 receptor (a protein) and interleukin 2 receptor activity (a GO concept). It augments Gene Ontology (GO) annotations  by extracting protein-GO-phenotype relations from the literature using MeSH terms  and a natural language processing (NLP) system, BioMedLEE, combined with the PhenOS phenotype ontology organizer system. The phenotypic information, including diseases, tissues, and organs, is encoded into Unified Medical Language System (UMLS) codes as well as other ontological coding systems. PhenoGO was evaluated for anatomical and cellular context in mice, demonstrating a recall of 92% and a precision of 91% . PhenoGO has since been extended to comprise over 523,000 unique entries associating disease phenotypes, ontological concepts, and proteins. In total, PhenoGO now contains data from 8,509 distinct PubMed articles, representing 7,016 distinct proteins classified under 3,214 distinct GO concepts in 3,102 distinct diseases. From a random sample of 120 Protein-disease-GO ternary annotations, precision was estimated at 85%, and recall at 76% [unpublished result].
In order to identify associations between diseases by mapping their respective protein interaction networks with statistical significance values, we took the following steps. An overview of the process is pictured in Figure 1.
was achieved though Structured Query Language querying of the PhenoGO database. We extracted all UMLS-coded diseases classified under the “Disease” semantic type hierarchy along with their associated proteins. In this study, we chose to stay on a more conservative side, and only extracted diseases associated with more than 4 proteins to avoid errors stemming from mis-assignment in PhenoGO and to reduce spurious predictions in the next step from the hypergeometric distribution because a single error contributes proportionally to a larger statistical impact on a smaller sample of protein in the statistical method that follows (equation 1). These UMLS-coded terms fall under the UMLS semantic types ‘Congenital Abnormality’, ‘Disease or Syndrome’, ‘Experimental Model of Disease’, ‘Anatomical Abnormality’, and ‘Neoplastic Process’. The resultant set consists of 154 diseases and their 1,931 associated proteins (http://phenos.bsd.uchicago.edu/PSB2007/).
The second step is to correlate diseases with their underlying protein-protein interaction networks using a statistical approach. In this study, we used the Reactome protein interaction dataset  to define the underlying topological networks associated with these diseases. The common proteins between disease-associated proteins in PhenoGO and proteins in the Reactome were identified by using the identifiers in the UniProt . The Reactome data set defines four distinct types of reactions: 1) neighboring reactions, which define interactions that occur consecutively; 2) indirect complexes, which define interactions which involve subcomplex interaction, but not direct binding/interaction; 3) direct complex, defining protein-protein complexes; and 4) reaction, representing situations where the two proteins participate in the same reaction . The Reactome dataset was normalized to a set of paired Swiss-Prot accession numbers, and filtered to remove neighboring reactions and indirect complexes, leaving only entries for binary interactions and direct complexes. This data set contains 20,317 distinct interactions corresponding to 1,140 distinct proteins. From the 154 diseases, we generated combinations of pairs of diseases, and for each pair of diseases, proteins in both diseases were also paired for all potential combinations. These protein pairs were then cross-referenced with our filtered Reactome data set to determine if they participated in reactions or formed direct complexes with one another. There are two basic types of relationships used in calculations in our methods. These relationships correspond to the two scenarios we considered to determine whether two diseases share interaction networks: 1) an identity relationship where common proteins are shared by two diseases, and 2) direct interactions between protein A in one disease and protein B in the other disease. As related diseases can share both types of relations, and due to the requirements of the hypergeometric distribution, we consider both in the underlying protein-protein interaction network in diseases. Based on this, we calculated the correlations between all possible pairs of diseases by applying the hypergeometric distribution function to identify significantly correlated diseases (equation 1) and adjustments for multiple a posteriori comparisons (equation 2), as shown below:
In equation 1, ‘N’ represents the total number of all pair combinations between proteins of any two diseases in the experiment that includes the possibility of sharing the same proteins (identical protein pair between two diseases), ‘M’, as the sum of number of observed distinct pairs of interacting proteins that exist in the Reactome database for all the diseases in the experiment (direct interaction only), ‘n’ as the putative total number of pairs of proteins that could exist in a pair of disease, and ‘m” as the sum of the observed number of common proteins shared between two specific diseases and the number of distinct pairs of interacting proteins observed in the Reactome database for these two specific diseases (M ∩ n). This measure gives a p-value which is then adjusted for multiple comparisons with the Dunn-Sidak method (a derivative of the Bonferroni method):
In equation 2, p’ and p represent the corrected and uncorrected p-values, respectively, and r represents the number of independent comparisons, which is the number of disease pairs (r=11,703) used in the study. These corrected p-values are then thresholded at p<0.05 to determine the final set of significantly correlated disease-disease relationships. Multiple diseases and genes sharing the same PubMed IDs can artificially boost the statistical significance of these disease pairs, therefore relationships mapping to more than 2 overlapping PubMed IDs were removed to reduce the this artifact. A total of 11,703 disease pairs passed the filter out of 11,780 candidates. 77 combinations had more than two PMID overlaps and were filtered out as a result of this process. An example of values used for the calculation is described in the results section.
Two evaluations were conducted. The first one, a quantitative evaluation, was designed to control for the error rate in either assigning a protein disease relationship in PhenoGO or a protein-protein interaction in Reactome. It consisted of establishing the reliability of the predictions if we introduced noise in the integrated database network (10% more protein-protein interactions in the same set of diseases). The second one, a qualitative evaluation, consisted of carefully examining the discovered protein interactions shared by two diseases and identifying references in the scientific literature that validate the phenotypic overlap and potentially the protein interactions.
In this study, we examined a subset of PhenoGO pertaining to human diseases in order to identify relationships between these diseases according to criteria described in the methods. This filtering resulted in a set of 154 diseases and their 1,931 associated proteins. The intersection between the proteins of the Reactome and those of PhenoGO further reduced the set of proteins to 286. The number of candidate proteins per disease was greatly reduced by the need to be present in the Reactome dataset, and therefore the totals are smaller than observed in the PhenoGO database alone. We lose approximately 70% of the proteins in this process due to the limited content of the Reactome. In order to identify relationships between these diseases, we analyzed their underlying protein-protein interaction maps by applying a statistical method (details of equations in the Method Section). Of the 154 selected diseases, there are (285*286/2+286) = 41,041 distinct combinations of protein pairs and identical protein overlap (term N, equation 1) possible for all possible disease pairs, of which only 4,857 exist in the Reactome (term M, equation 1). Figure 2 summarizes the distribution of protein-protein pairs per combination of diseases in our set. In ~60% of the 11,703 disease pairs under consideration, the number of potential protein-protein interactions is five or less (no significant predictions from this category), and about 40% of them have more than five interactions. We then proceeded in calculating the correlation between groups of pairs of interacting proteins associated with every pair of diseases according to equations 1 and 2 (file available at http://phenos.bsd.uchicago.edu/PSB2007/). Based on the correlations of the shared protein interacting pairs between diseases, we identified 10 pairs of diseases that are significantly correlated due to their shared proteins and protein-protein interactions out of 11,703 disease pairs examined in this study (Table 1).
We added 2031 “false positive” interactions between random nodes in the network to evaluate the robustness of the method to 10% noise in the network. We found that even with the introduced noise, none of the p-values in the top 10 entries changed. We also attempted adding 10% noise (46 “false” interactions) in just the 286 proteins under study, which changed the p-values of the top 10 entries, but left their rank order relatively intact (results available at http://phenos.bsd.uchicago.edu/PSB2007/).
The top ranked disease pairs are shown in Table 1, all of which have a significant adjusted pvalue less than 5%. The last column of Table 1 provides strong scientific evidence in support of the predictions. We have manually examined all the significant disease pairs, and confirmed their correlations in the literature, demonstrating our method can successfully predict non trivial correlations between different diseases. Among these pairs of diseases, Cockayne Syndrome (CS) and Xeroderma Pigmentosum (XP) provide a very interesting example on how two diseases are correlated through their protein-protein interaction networks. Xeroderma Pigmentosum is a disorder conferring susceptibility of the skin to ultraviolet radiation, due to deficiencies in one of the XPA-XPG complementation group genes involved in nucleotide excision repair . Similarly, Cockayne Syndrome involves deficiencies in transcription-coupled repair genes ERCC6 and ERCC8 leading to a number of conditions including abnormal sensitivity to sunlight. As shown in Figure 3, there are 27 direct protein-protein interactions and 5 common proteins (term m =27+5, equation 1) that are shared by these two diseases. A total of 66 potential combinations of protein-protein interaction pairs (term n, equation 1) can be formed between the 11 proteins of XP and the 6 proteins of CS.
As shown in the Figure 3 and described in Table 2, we find that most proteins in the common networks between the two diseases are related to DNA repair processes, which are Global Genomic Nucleotide Excision Repair (NER) and Transcription-coupled NER. The Global Genomic NER repairs lesions from non-transcribed regions of genome, a process independent to transcription, and the Transcription-coupled NER repairs UV-induced damage in the transcribed strands of active genes. Both Cockayne syndrome and Xeroderma Pigmentosum are associated with these processes, suggesting defects in the repair of DNA damage are the cause of the diseases, as indicated in the literature . Our computational approach allows us to quickly identify the shared networks between these two diseases, demonstrating the method we used is able to identify the underlying molecular basis shared by these diseases.
In some cases, disruptions in any of the proteins or genes lying on a pathway can lead to a disease phenotype. This is the case with both Xeroderma Pigmentosum and Cockayne syndrome. At a higher classification level, these two previous diseases are a result of deficiencies in the DNA repair pathway, a class also shared with Li-Fraumeni Syndrome . Though these three single gene diseases have a known initial molecular cause, how this cause is related to DNA repair pathways and whether the diseases share the same pathway or related disjoint pathways may be poorly understood.
In another example, Fanconi’s Anemia (FA) is a hereditary DNA-repair deficiency characterized by hypersensitivity to DNA damaging agents. This disorder is caused by a mutation in any one of genes in the Fanconi’s Anemia complementation group: FANCA, FANCB, FANCC, FANCD1, FANCD2, FANCE, FANCF, FANCG, FANCJ, FANCL, or FANCM [38–40]. Its phenotype is complex and includes anemia, several congenital malformations, and a strong predisposition to cancers [38, 39]. Kutler et al. (2003) analyzed clinical data from 754 FA patients from North America enrolled in the International Fanconi Anemia Registry, of whom 173 (23%) had a total of 199 neoplasms (28 distinct types of cancers) . Among 14 potential protein interactions between Fanconi’s Anemia and Colorectal Neoplasms, 8 were found to exist in the Reactome.
An evaluation of the relationship between the generality of a disease class (based on graph-theoretic distance from the “MeSH Descriptor” node in the UMLS) and the number of proteins annotated to it found no correlation (available at http://phenos.bsd.uchicago.edu/PSB2007/).
The protein-protein interaction network constructed by the Reactome dataset provides us a framework for structuring the knowledge of human diseases, which enables an objective approach to examine the molecular underpinnings of diseases in the context of their known molecular interactions on genomic scale. This method not only allows us to conduct high throughput computational analysis of the relations between diseases, but also reveals the underlying molecular relationships between diseases. Furthermore, new relationships between well-known diseases and new diseases could be revealed based on their overlapping molecular networks.
Although many diseases have been associated with their genetic and proteomic underpinnings, little research has been focused on bridging the gap between protein interactions and the relationships between diseases. Phenotype clustering methods achieve this to some extent. For example, Brunner and van Driel used a text mining approach based on MeSH terms as keywords over the OMIM database  to cluster similar disease phenotypes. Our implementation of the hypergeometric distribution significantly differs from its common use in bioinformatics. Other authors have used this distribution in large scale gene expression studies to identify “over-represented” gene classes (e.g. Gene Ontology classes) and find systemic patterns . This classical implementation would be efficient in recognizing overlapping proteins or proteins sharing annotated pathways in GO, but would not recognize novel protein interaction based on newly discovered or predicted protein interactions. In contrast, we focused on protein interactions and thus counted the protein pairs rather than the genes’ assignments to categories. The proposed analytical approach could scale up in two ways. First, we could extend it to proteins interacting indirectly through a pathway rather than directly interacting in the Reactome (through additional join operations in the database in order to determine those interacting with one or more intermediate proteins in pathways). In doing this, the Bonferroni-type adjustment would have to be replaced with a data-derived control for multiple comparisons such as bootstrap or permutation resampling in order to interpret the results. A second, probably more useful way in which this analysis can scale up is its use with the rapidly expanding number of protein-interaction databases, many of which are not publicly available. The subset of the PhenoGO database used in this study can readily be reused in a similar manner over another protein interaction database containing more genes and provide other specialized predictions.
One question about the use of this technique is its reliability when conditions change. Since we used well established statistics and one of the most severe multiple comparison criteria for controlling for false predictions, we believe this method is robust. As this technique relies on integrating accurate protein-protein interactions with accurate gene-disease associations, and both of these datasets likely contained at least 10% false positive relationships, we conducted an evaluation adding false relationships in the network and confirmed that the identified disease pairs sharing protein networks were reliable in spite of the noise. Nonetheless, this approach remains limited by the quality of the underlying protein networks, and the accuracy of protein-disease mapping. Currently, the protein-protein interaction network is still at the early developmental stage. In this study, we extracted 1,931 proteins from 154 diseases, of which only 288 proteins exist in the Reactome dataset that contains 1,140 proteins. Therefore, the interaction network we used to correlate relationship between diseases is relatively small. Certainly, as bioinformatics databases become larger and more accurate this discovery method could become a valuable tool to identify relationships between diseases.
We intend to explore a permutation based resampling in order to unveil additional valid relationships. A resampling-based approach would help determine the optimal relationship between quantity and quality in the dataset. We also plan to significantly extend the protein-disease associations by mining additional genetic datasets. Besides using the Reactome, we also demonstrated we could use DIP , although it is smaller than Reactome [results not shown]. Since the UMLS is used to encode the diseases, we plan to compare related diseases and their associated protein-protein interactions in order to establish the molecular basis of disease relationships in ontologies.
We developed and evaluated an automatic system to predict protein interactions shared by two or more diseases. It augments current protein interaction networks by integrating literature-based knowledge of protein-disease associations and systematically identifying the statistically significant Protein Interactions of Diseases (PID). Results demonstrated that the PID system provides accurate predictions and is scalable in a number of dimensions: (i) it enables high throughput predictions, and (ii) it scales across different protein-interaction datasets. Beyond direct protein-protein interactions, it also provides the theoretical framework to compare shared pathways between diseases. In the future, this framework could be applied to more complex diseases to determine if their shared phenotypes are a result of the shared molecular mechanism and pathways.
#This study was supported in part by NIH/NLM grants 1K22 LM008308-01, R01 LM007659, R01 LM008635, and the National Center for the Multi-scale Analysis of Genomic and Cellular Networks (U54CA121852-01A1)