An essential objective of genome-scale sequencing and functional genomics is to improve on the paucity of associations between genetic variations and human diseases or other phenotypic traits (such as birth weight) and the impact of epigenetic modifications. From these upstream genetic causes, the ultimate goal is to achieve an improved form of personalized medicine based on individual patient's genetic variation.1
Genetic disorders are often categorized as single-gene diseases or as complex, multi-gene diseases such as cancers and diabetes. Typical examples of single-gene diseases are those of Mendelian inheritance, caused by mutations in an individual gene that result in an altered function or loss of its ability to properly interact with other genes.2
In contrast, complex diseases arise from the interplay of many different genes and single-nucleotide polymorphisms (SNPs). Although many of these diseases are common, their driving genetic mechanisms remain poorly understood on a molecular level.
To pinpoint the genes involved in complex diseases and elucidate their underlying genetic variations, hundreds of genome-wide association studies (GWAS) have been carried out and compare affected individuals with control cohorts. Despite the fact that many disease alleles have been discovered, most possess only a small effect size: OR<1.5.4
Therefore, it is unlikely that a few SNPs alone give rise to complex diseases, and it is more probable that an accumulation of large combinations of SNPs and other forms of genetic variations disrupt key biological mechanisms and consequently alter normal human physiology.5
Since the clinical functions of numerous intragenic trait-associated SNPs (SNPs located within gene regions) remain uncharacterized, the genetic architectures within and between these traits are thus also poorly understood. For instance, obesity is a disease that often fundamentally contributes to many other diseases such as diabetes and hypertension, and indeed obesity-associated genes have been prioritized in adult-onset diabetes GWAS.6
We therefore hypothesize that there should theoretically be some core shared SNPs, genes, or biological pathways that contribute to or cause common underlying traits. Such genetic architecture is evident in cancer, where gain-of-function mutations in oncogenes occur in the same genes across distinct cancers. Furthermore, these central genetic architectures can contribute to and link the diseases found within a particular metatrait, defined as a class of disorders clinically related in time (eg, one disease causally precedes another) or sharing common molecular functions and processes. For example, oncogenic processes leading to ‘cancer’ can together be considered a metatrait that comprises different types of specific cancers, as their somatic mutations often overlap genetically or functionally. Similarly, metabolic syndrome is considered a metatrait that includes insulin resistance, hypertension, obesity, hyperlipidemia, and hypercholesterolemia. Therefore, to elucidate this genetic underpinning, great emphasis is placed on shared characteristics, such as symptoms and drug responses, when developing disease networks8
and their causative genetic networks.
Numerous methods have been developed to construct human disease networks and can be categorized as either non-SNP-based or SNP-based methods. Information in electronic medical records, such as disease correlation or comorbidities, can be directly applied to construct disease networks.10
Furthermore, underlying biological disease data, such as mRNA expression profiles9
and protein–protein interactions (including protein complexes),11
can also be employed to infer disease networks. Additionally, metabolic data, such as adjacent or mutual biochemical reactions, have also been used in disease network development.15
Recently, with the dramatic increase in genetic variation data and GWAS results, shared intragenic SNPs and their host genes (the genes physically containing the variations)16
have been used to link distinct diseases, both single-gene inherited diseases17
and complex diseases.19
However, unlike single-gene diseases, these early complex-disease network studies that use simple SNP and gene overlaps have not obtained the expected modularization results (related diseases highly connected with each other) because of small dataset sizes. Specifically, many diseases were found to be isolated and totally disconnected from other diseases within the same disease class.19
Since previous disease network modeling methods have been mainly based only on analyzing gene overlap or clinical relatedness as found in the GWAS or the medical record rather than biological relatedness, only those diseases with obvious genetic or clinical connections have been highlighted. Furthermore, the majority of these networks used Mendelian inheritance facts from the Online Mendelian Inheritance in Man (OMIM) rather than complex inheritance patterns from GWAS. Many diseases are obviously clinically related, eg, the comorbidity between hypertension and obesity; however, no overlapping genes or SNPs have been discovered by GWAS to date. Therefore, more complex ways of relating two diseases and constructing disease networks must be designed in order to understand their common pathologies and relatedness, which can further our ability to treat diseases.
One application of such networks would be drug repositioning accomplished through the identification of shared biological mechanisms between one treatable disease and one for which no effective treatment exists. This can be conducted using network theoretic models in which biological mechanisms are used to relate diseases and their associated molecular structures. Novel methodologies that can mechanistically relate diseases that are observably clinically related but have little or no shared genetic or physiological underpinning may elucidate more complex mechanisms and point to therapies that can be repurposed between the two.
To address this issue, we propose a novel method that builds disease–disease networks that extend well beyond mere shared SNPs or host gene linkages. To this end, we exploit the semantic similarity among host genes of validated trait-associated SNPs in the National Human Genome Research Institute Catalog of Published Genome-Wide Association Studies (NHGRI GWAS Catalog)20
via existing annotations of host genes in Gene Ontology (GO)21
to build a similarity network of diseases with an information theory-based approach. Specifically, our method integrates genetic alteration (GWAS) data with standardized textual descriptions of gene functions and processes and their inter-relationships in order to characterize the mechanistic underpinnings of disease of complex inheritance. We hypothesize that similarity among clinically related diseases is reflected in the similarity between their constitutive deregulated processes and functions that can be investigated computationally through the biological annotations associated with their genes hosting intragenic GWAS SNPs (host genes). Thus, we use a novel application of GO to computationally examine and compare data derived from GWAS.
We further analyze our disease–disease network using protein interactions to create a disease–gene network (which also contains disease–disease, gene–gene, and disease–gene connections). Integration of protein interaction data allows the identification of functional similarity network relationships that can be explained straightforwardly at the protein level and those that are most likely due to higher scale biological processes (eg, cell proliferation associated with cancer disease). Further, this enhances the current paradigm of ‘targeted’ therapy repositioning, which implies a ‘protein target’ and is thus better understood at the protein level, with non-trivial and multi-scale biological mechanisms unveiled by similarity metrics (GWAS/SNP/GO). We have previously demonstrated that this gene information theoretic similarity (ITS) method can accurately predict protein functions in poorly characterized genes22
and, further, can exploit the shared genetic architecture of diseases by using their common interactions or interaction paths. Thus, we hypothesize that this sensitive similarity approach could allow the elucidation of non-trivial associations between trait-associated genes. Additionally, we constructed our network based on a much larger number of NHGRI intragenic SNPs than previous studies.