|Home | About | Journals | Submit | Contact Us | Français|
In an effort to understand molecular mechanisms of human disease and to determine genes responsible, we systematically examine relationships between 3,949 genes, 62,663 mutations and 3,453 associated disorders within the framework of a three-dimensional structurally resolved human interactome, consisting of 4,222 high-quality binary protein-protein interactions with their atomic-resolution interfaces. We find that in-frame mutations (missense point mutations and in-frame insertions and deletions) are enriched on the interaction interfaces of proteins associated with the corresponding disorders, indicating that alteration of specific interactions by in-frame disease mutations is critical in understanding the pathogenesis of many genes. Furthermore, locations of mutations on proteins with regard to interaction interfaces are significantly associated with underlying pathogenic processes and the disease specificity for different mutations of the same gene. Based on these findings, we generate 292 new gene candidates for 694 unknown disease-to-gene associations with proposed molecular mechanism hypotheses, readily expanding our understanding of human genetic diseases and corresponding therapeutic possibilities.
Over the past few decades, a tremendous amount of resources and effort have been invested in mapping human disease loci genetically and later physically1. Since the completion of the human genome sequence, especially with advances in genome-wide association studies and on-going cancer genome sequencing projects, an impressive list of disease-associated genes and their mutations have been produced2. However, it has rarely been possible to translate this wealth of information on individual mutations and their association with disease into biological or therapeutic insights3. Most of the US Food and Drug Administration approved drugs today are palliative4 – they merely treat symptoms, rather than targeting specific genes or pathways responsible, even if associated genes are known. One main reason for this lack of success is the complex genotype-to-phenotype relationships among diseases and their associated genes and mutations. In particular, (a) the same gene can be associated with multiple disorders (gene pleiotropy); and (b) mutations in any one of many genes can cause the same clinical disorder (locus heterogeneity). For example, mutations in TP53 are linked to 32 clinically distinguishable forms of cancer and cancer-related disorders, whereas mutations in any of at least 12 different genes can lead to long QT syndrome.
With the publication of several large-scale protein-protein interaction networks in human5-8, researchers have recently begun to use complex cellular networks to explore these genotype-to-phenotype relationships2,9, on the basis that many proteins function by interacting with other proteins. However, most analyses model proteins as graph-theoretical nodes, ignoring the structural details of individual proteins and the spatial constraints of their interactions. Here, we investigate on a large-scale the underlying molecular mechanisms for the complex genotype-to-phenotype relationships by integrating three-dimensional (3D) atomic-level protein structure information with high-quality large-scale protein-protein interaction data. Within the framework of this structurally resolved protein interactome, we examine the relationships among human diseases and their associated genes and mutations.
We first combined 12,577 reliable literature-curated binary interactions filtered from six widely used databases10-15 (Online Methods) and 8,173 well-verified high-throughput yeast two-hybrid (Y2H) interactions5-8 to produce the high-quality human protein interaction network (hPIN) with 20,614 interactions between 7,401 proteins (Fig. 1a).
Next, we structurally resolved the interfaces of these interactions using a homology modeling approach16. We used both iPfam17 and 3did18 to identify the interfaces of two interacting proteins by mapping them to known atomic-resolution 3D structures of interactions in the Protein Data Bank (PDB)19 (Fig. 1a). Only those interactions in which the interacting domains of both partners (or their homologues) can be found in a 3D structure of an interaction are kept, resulting in the human structural interaction network (hSIN) of 4,222 structurally resolved interactions between 2,816 proteins (Fig. 1a). Here, we carefully selected high-quality direct physical interactions between human proteins because interaction databases often contain low quality and/or non-binary interactions20-22, for which interaction interfaces do not exist.
Finally, to compile a comprehensive list of disease-associated genes and their mutations, we combined information from both Online Mendelian Inheritance in Man (OMIM)23 and the Human Gene Mutation Database (HGMD)24 (Fig. 1a). In total, we were able to collect 62,663 Mendelian mutations in 3,949 protein-coding genes associated with 3,453 clinically distinct disorders (Supplementary Note 1), of which 21,716 mutations in 624 disease-associated genes were mapped to corresponding proteins in hSIN (Fig. 1a,b).
To evaluate the reliability of our homology modeling approach, we cross-validated domain-domain interactions in 1,456 interactions with co-crystal structures and found that over 94% can be correctly inferred from their homologous domains of other interacting pairs in the dataset (Supplementary Note 2). To further verify the quality of hPIN and hSIN, we investigated enrichment of highly co-expressed and functionally similar25 interacting pairs in these networks as well as unfiltered interactions relative to random pairs (Supplementary Note 3). We found that hPIN has a significantly higher enrichment for co-expressed and functionally similar pairs than unfiltered interactions (P = 0.002 and P < 10−20 by cumulative binomial tests, respectively; Fig. 1c,d), verifying the high quality of hPIN and our filtering process. More importantly, hSIN exhibits an even higher enrichment (P < 10−13 and P < 10−20 by cumulative binomial tests, respectively; Fig. 1c,d), illustrating the importance of structural resolution.
Disease mutations can be classified into two broad categories - in-frame mutations (including missense point mutations and in-frame insertions or deletions) and truncating mutations (including nonsense point mutations and frameshift insertions or deletions). Disease alleles with in-frame mutations are likely to produce full-length proteins with local defects, whereas those with truncating mutations will only give rise to incomplete fragments. Our list comprises 12,059 in-frame mutations and 9,657 truncating mutations from 624 genes in hSIN.
Although individual experiments have shown that in-frame mutations can lead to loss of interactions26, previous studies have concluded that only a small fraction of disease-associated mutations are expected to specifically affect protein-protein interactions27,28. To explore the relationships between mutations and their associated disorders, we investigated positions of the disease-associated mutations with regard to interaction interfaces on the corresponding proteins. Among the 12,059 in-frame mutations, we found that 7,833 are located on interaction interfaces, which is significantly enriched with respect to the relative length of interfaces to whole proteins (Odds ratio = 2.1, P < 10−20 with a Z-test; Fig. 2a). In contrast, an enrichment of in-frame mutations was not detected in other non-interacting domains (Odds ratio = 1.0, P = 0.70 with a Z-test; Fig. 2a). This indicates that specific alteration (disruption or enhancement; Supplementary Note 4) of protein-protein interactions plays an important role in the pathogenesis of many disease genes, more than previously expected27 (Supplementary Note 5). On the other hand, truncating mutations seem to be distributed randomly throughout the protein (Fig. 2b). We also examined the distribution of 13,783 non-synonymous single nucleotide polymorphisms (SNPs)29 in 806 disease genes in hSIN and found that they too are randomly distributed (Fig. 2c and Supplementary Note 6). These results further confirm our conclusion because alleles with truncating mutations are more likely to produce non-functional products26 and most SNPs in dbSNP are considered to be non-disease-related30.
To verify that the in-frame mutations on the interfaces in hSIN can interfere with protein interactions, we manually compared them with an independent list of known interaction-altering missense mutations that could be mapped to genes in hSIN27. The majority (81%) of these mutations (72 mutations in total) are indeed localized on the interaction interfaces according to hSIN (Fig. 2d), confirming the coverage and quality of hSIN (Supplementary Note 7).
We also experimentally evaluated the effects of disease-associated mutations and non-disease-related SNPs found in MLH1, a well characterized human DNA mismatch repair gene frequently mutated in hereditary nonpolyposis colorectal cancer (HNPCC)31. MLH1 is known to interact with many proteins, including its heterodimeric partner PMS2, but the structural basis of most interactions, including with PMS2, still remains unknown. Our hSIN predicts that the HATPase_c domain and the DNA_mis_repair domain on MLH1 are potentially responsible for MLH1's interaction with PMS2 (Fig. 2e). Therefore we hypothesized that mutations within these two domains are likely to alter this interaction. To test our hypothesis, six different in-frame colorectal-cancer-associated mutations and three non-synonymous SNPs found in MLH1 were tested by Y2H for their abilities to alter the MLH1-PMS2 interaction (Supplementary Note 8 and Supplementary Fig. 1). Compared to the wild-type MLH1, only missense mutations (I68N, I107R, Y293D) within the predicted PMS2 interacting interface greatly reduce the MLH1-PMS2 interaction (Fig. 2f). These experimental results further confirm the validity of our predicted interaction interfaces in hSIN. Moreover, they show that in-frame mutations enriched on interfaces could indeed alter corresponding interactions.
Disease genes are often associated with multiple clinically distinct disorders2. To investigate how mutations in the same gene can cause different phenotypes, we examined the relationships between potentially interaction-altering in-frame disease-associated mutations within our atomic-resolution structural interaction network, hSIN.
By analyzing the distribution of in-frame mutation pairs on the same gene (Supplementary Note 9), we found that in-frame mutation pairs on different interaction interfaces are more than twice as likely to cause different disorders as those on the same interface (46% and 21% respectively, P < 10−20 by a cumulative binomial test; Fig. 3a). This suggests that the number of interactions and interfaces are key in understanding the pleiotropic effects of disease genes. Mutations on interaction interfaces of the same protein mediating different interactions are more likely to cause distinct interruptions in the overall interactome and can therefore result in different biological consequences and lead to pleiotropic effects. Interestingly, there is no such difference between mutations in different non-interacting domains, further underscoring the importance of protein-protein interactions and their role in understanding disease.
One well-studied example of pleiotropy is the Wiskott-Aldrich syndrome protein (WASP)32 (Fig. 3b). Mutations in this protein can give rise to three diseases: Wiskott-Aldrich syndrome (WAS), X-linked thrombocytopenia (XLT) or X-linked neutropenia (XLN). WAS and XLT are related diseases with XLT being a milder form of WAS, both of which are clinically distinct from XLN (Supplementary Note 10). Based on our 3D structural analysis using hSIN, mutations associated with WAS and XLT are in or around the WH1 domain, which is responsible for interaction with VASP; mutations for XLN on the other hand are all inside the PBD domain, which performs an entirely different function by interacting with CDC42 and regulating the auto-inhibition and potentially localization of WASP33-35(Fig. 3b). More interestingly, our experimental results confirm that mutations on different interfaces of WASP function differently in terms of altering protein interactions. Specifically, we compared interactions of CDC42 with the wild-type WASP and three disease-associated variants using Y2H. Neither mutation (R41G and E131K; associated with WAS/XLT) located within WH1 domain affects WASP's interaction with CDC42 (Fig. 3c, Lanes 3 and 4). However, for the first time we provided experimental evidence that one amino acid change within the PBD domain (I294T; associated with XLN) greatly reduces the WASP-CDC42 interaction (Fig. 3c, Lane 2). Previous in vitro analysis has shown that I294T increases WASP activity36, our result suggests that I294T might function by disrupting the WASP-CDC42 interaction, therefore affecting WASP's regulation by CDC42.
Uncovering the mechanisms through which mutations in different genes can lead to the same disease is critical in finding novel disease-associated genes and ultimately understanding and treating the corresponding disease. Based on the widely accepted “guilt-by-association” principle, interacting proteins have been shown to have a tendency of sharing similar functions and causing the same disorders37. Earlier implementations of this idea had a significant impact and led to the determination of important disease associations for genes38. However, the fraction of successful predictions is still relatively small39. One main reason is that most interacting protein pairs only share a subset of their associated disorders.
To understand the underlying molecular mechanism for this phenomenon, we calculated the distribution of in-frame mutation pairs on two different proteins that cause the same disorder (Supplementary Note 9). We found, in agreement with previous studies2, that in-frame mutations on interacting proteins are generally much more likely to cause the same disorder (12%) than random expectation (0.17%, P < 10−20 by a cumulative binomial test; Fig. 3d). More importantly, our results show that the likelihood for two in-frame mutations on the corresponding interfaces of the interacting proteins to cause the same disorder (14%) is significantly higher than that for two in-frame mutations on two interfaces not mediating their interaction (5.6%, P < 10−20 by a cumulative binomial test; Fig. 3d). These results further indicate that alteration of specific interactions, caused by mutations on corresponding interfaces of two interacting proteins, plays an important role in the pathogenesis of the same disorder. An interesting example is the hemolytic uremic syndrome, which is associated with mutations on the corresponding interaction interfaces of both CFH and C3 that mediate the interaction between the two proteins40 (Supplementary Note 11 and Supplementary Fig. 2).
Our 3D structural analysis provides potential atomic-level understanding for some of the complex genotype-to-phenotype relationships. More importantly, these results enable us to generate a concrete molecular mechanism hypothesis for mutations of a certain disorder enriched on a specific interaction interface – they may cause their associated disorders via alteration of the interactions mediated by the corresponding interfaces (Fig. 4a, Supplementary Fig. 3 and Supplementary Note 4). Based on this proposed model, we can further predict new disease-associated genes (those that interact with known disease genes through the interfaces enriched with mutations associated with a certain disease; Supplementary Note 12 and Supplementary Fig. 4). Therefore, our analysis provides a much higher resolution application of the “guilt-by-association” principle. We then applied this principle to uncover unknown disease-associated genes using hSIN. For each disease, we selected proteins in hSIN that have at least 3 mutations associated with a certain disease and at least 1.5-fold enrichment on interaction interfaces (Online Methods and Supplementary Note 13). Other proteins interacting through the interfaces with enriched disease-specific mutations are predicted to be associated with the corresponding disease. In total, we predicted 292 new disease genes for 182 different diseases, representing 694 novel disease-to-gene associations. Using three-fold cross-validation, we confirmed that our structurally resolved interactome greatly improves the performance of predicting disease-associated genes, compared with existing interaction networks where proteins are modeled as simple graph-theoretical nodes (Supplementary Note 13 and Supplementary Figs. 5 and 6).
To further experimentally validate our predictions, we examined the TP63-TP73 interaction. Unlike its paralog, the well-known tumor suppressor gene TP53, TP63 has an important role in epithelial development41. Sequence analysis suggested TP63 mutations are responsible for Ankyloblepharon-ectodermal defect-cleft lip/palate (AEC) and Rapp-Hodgkin syndrome, two clinically similar disorders (Supplementary Note 14)42. Interestingly, most of mutations cluster in the SAM2 domain of TP63. Based on the known co-crystal structure of DGKD homodimer43, we predict that the SAM2 domain is potentially part of the interface for the TP63-TP73 interaction (Fig. 4b). Therefore, we hypothesized that mutations in the SAM2 domain could affect this interaction. We examined four mutations associated with AEC/Rapp-Hodgkin syndrome in the SAM2 domain (I549T, F565L, S580P, R594P) using Y2H. The protein expression levels of the mutants are comparable to the wild-type TP63 (Fig. 4c, middle panel). Our Y2H results indicate that all four mutations cause great reduction of the TP63-TP73 interaction. This suggests that the disruption of proper binding between TP63 and TP73 might contribute to the observed phenotypes, and thus TP73 might also be involved in AEC/Rapp-Hodgkin syndrome.
From our 3D analysis of disease-associated mutations and their corresponding genes within the atomic-level structurally resolved human protein interactome, we find that specific alteration of protein interactions by in-frame mutations plays an important role in the pathogenesis of many disease genes. More importantly, our results show that the locations of the mutations with respect to the interaction interfaces are crucial in understanding the complex genotype-to-phenotype relationships, including pleiotropy and locus heterogeneity. All observations are demonstrated to be robust to the removal of random interactions and proteins as well as interaction, disease and domain hubs, potential biases that might be present in our datasets (Supplementary Note 15 and Supplementary Figs. 7-22). Furthermore, all observations remain the same when the calculations are repeated using only known domain-domain interactions from existing co-crystal structures (Supplementary Note 16 and Supplementary Fig. 22). Our findings are directly applicable to understanding molecular mechanisms of human genetic diseases and discovering new disease-associated genes and mutations both experimentally and computationally, which is of significant interest to both pharmaceutical and medical industries and especially important for treating diseases currently with undruggable target genes. To this end, we provide a list of novel disease-to-gene associations and generate many new hypotheses. Moreover, with the development of exome sequencing, many mutations are being discovered in every study44. It is difficult to determine their functional relevance experimentally all at once. Our analysis could potentially provide a novel approach to prioritize mutations discovered in large-scale sequencing projects, especially for protein pairs without known co-crystal structures.
The construction of our structurally resolved protein interactome largely relies on the availability of 3D co-crystal structures, which limits the coverage of our network. However with the rapid growth of PDB45, more co-crystal information will become available and the same principles that we developed here can be readily applied to uncover potential molecular mechanisms of many more disease genes whose structural information is currently missing. Another limiting factor is that some interaction interfaces fall outside of the known domain structures, including the disordered regions46. Incorporating this type of information will further improve the coverage of hSIN. Moreover, other parts of the protein, especially regions immediately outside of the interacting domains we predicted, might also contribute to the interaction directly or contribute to the correct folding of the corresponding domains. For example, a previous study indicated that the SAM2 domain alone might not be sufficient for the TP63-TP73 interaction and suggested that residues upstream and downstream of the SAM2 domain and the P53_tetramer domain could also be involved in the interaction47. Accordingly, based on the known co-crystal structure of TP53 homodimer48, we also predicted in hSIN that the P53_tetramer domain of TP63 could also be part of the interface for this interaction.
Although we have shown that the interaction pairs in hSIN have significantly higher co-expression correlation and functional similarity in general, further studies can be carried out by considering gene expression under disease-specific conditions and/or within corresponding tissues for specific disorders. Moreover, study of changes in the protein-protein interaction network during disease progression can also assist the identification of disease biomarkers and modules49. In addition to genetic mutations, many other factors including environmental stress, epigenetic modifications and invasion of pathogens might also contribute to human clinical disorders50. Integrating these factors in the follow-up studies of the hypotheses generated by our analysis will likely expand our understanding of many human genetic disorders in the near future.