|Home | About | Journals | Submit | Contact Us | Français|
The immense corpus of biomedical literature existing today poses challenges in information search and integration. Many links between pieces of knowledge occur or are significant only under certain contexts—rather than under the entire corpus. This study proposes using networks of ontology concepts, linked based on their co-occurrences in annotations of abstracts of biomedical literature and descriptions of experiments, to draw conclusions based on context-specific queries and to better integrate existing knowledge. In particular, a Bayesian network framework is constructed to allow for the linking of related terms from two biomedical ontologies under the queried context concept. Edges in such a Bayesian network allow associations between biomedical concepts to be quantified and inference to be made about the existence of some concepts given prior information about others. This approach could potentially be a powerful inferential tool for context-specific queries, applicable to ontologies in other fields as well.
The millions of published works of biomedical literature cover an enormous array of knowledge. Over 21 million articles are indexed in PubMed alone, and around 700,000 new articles are added yearly1. Additionally, data from millions of experiments are archived in diverse databases. The large size of today’s body of biomedical knowledge and swiftness with which new information is being added present challenges in organization and navigation. The rise of such a large amount of information in recent years is changing the nature of biological knowledge from a descriptive practice to a more data-driven one, and finding specific information through manual search is growing increasingly difficult.
Biomedical ontologies can potentially be used to address these challenges. Tremendous efforts have been made to create diverse ontologies that together include all biomedical concepts. The National Center of Biomedical Ontology (NCBO) BioPortal2,3 provides over 250 such ontologies with over 5 million concepts6 to researchers. Moreover, researchers use ontology terms to annotate experimental data and works of literature. Hence, an automated, efficient framework that navigates and integrates the information embedded in these ontological links would be a powerful research tool that utilizes an immense range of biomedical knowledge. However, ontologies are usually developed in a silo, and the separateness of ontologies has so far hindered the practical application of ontological organization. Hence, a crucial question remains unanswered: is it possible to automatically and efficiently use biomedical ontologies to infer new knowledge?
This work presents such an automated framework that integrates biomedical ontologies and infers knowledge from abstracts of literature and descriptions of experimental data in response to a user-defined query. In particular, this framework infers information particular to a given context, or situation. Context-specificity is useful because researchers often have questions relevant to specific situations, and the same biological concepts may be linked in some contexts but not in others. For example, two traits might not generally be observed together, but in the context of a specific genetic condition, they may coexist frequently. The proposed framework identifies these types of linkages.
A logical first step is to integrate the disparate biomedical ontologies. We seek a reliable framework for mapping ontological relationships that (1) considers diverse types of relationships between terms, (2) accounts for uncertainty in ontology integration, (3) is scalable to the size of biomedical ontologies, and (4) is able to be tailored to specific contexts. So far, no such framework has been developed.
The Unified Medical Language System (UMLS)4 has integrated over 2 million names for approximately 900,000 biological concepts. However, mappings of UMLS concepts were manually curated, so there remain inconsistencies and errors in the mappings, and it is difficult for mappings to keep pace with the rate at which knowledge is expanding5. Many non-manual methodologies exist for ontological integration, including semi-automatic methods such as PROMPT7 and GLUE8, and automatic methods such as IF-MAP9, ANCHOR-PROMPT10, and MAFRA11. In Chua et al.12, more than 30 ontology mapping methods are surveyed and categorized into 7 categories. However, almost all proposed methods are not publicly available or are not scalable to the size of biomedical ontologies13.
Two recent methods, Association Rule Ontology Matching Approach (AROMA)14 and Lexical OWL Ontology Matcher (LOOM)13, are publicly available and easily scalable. These two methods differ significantly. LOOM is used for discovering equivalence correspondences between concepts, is based on lexical matching, and does not require text corpora to work. In contrast, AROMA is used for inferring subsumption relationships between concepts, is based on a statistical measure known as implication intensity, and requires additional text corpora. Though these methods are steps forward, they do not consider the inevitable uncertainty of ontology mapping.
Some ontology-mapping studies do consider uncertainty by incorporating probabilistic uncertainty into their description logic by using Bayesian networks15–19. For example, a framework called OMEN16 creates Bayesian networks of ontologies by drawing initial probabilities from a priori knowledge and then using a set of meta-rules to determine conditional probabilities between nodes. The conditional probabilities represent influences induced by nodes on their children. Two other algorithms, MSBN17 and AEBN18, create pairwise correspondences between semantically identical concepts and propagate information through these correspondences between two ontology-specific Bayesian networks. The algorithm BayesOWL15 uses a process similar to those of MSBN and AEBN but is more comprehensive: it links similar concepts as well as identical concepts by defining the similarity of concepts probabilistically by their joint distribution. More methods for probabilistic modeling of uncertainty in linking ontologies can be found in Lukasiewicz20
The method presented in this paper is distinct from the aforementioned ontology-mapping methods in its use of a context-sensitive algorithm. In prior work, a context-specific mapping algorithm based on the Bayes factor21 was developed. This study adapts and applies that mapping method to construct the backbones of context-centered Bayesian networks for inference about biomedical relationships. This context-specific mapping approach has three main advantages: (1) mappings created are specific to the question under investigation, so unrelated concepts are pruned; (2) inference is less prone to noise generated from considering many unrelated concepts and can be more accurate; and (3) pruning many irrelevant concepts allows the inference algorithm to be scaled to the large size of most biomedical ontologies.
Once the Bayesian backbone is constructed, probabilistic inference on the framework accounts for conditional uncertainties in biological connections in the given context and gives more nuanced conclusions. The prior study21 focused primarily on gathering the literature base and developing the Bayes factor to conduct univariate linkage analysis between terms; here, the Bayes factor is used as a tool to consider multivariate relationships and in a high-dimensional Bayesian network, leading to more nuanced and meaningful results.
The proposed framework constructs and analyzes networks based on knowledge embedded in ontological annotations of descriptions of experimental data and abstracts of published literature. After obtaining an annotated knowledge database, the framework comprises three main stages: (1) defining the query, (2) constructing a Bayesian graph based on that query, and (3) using that graph to perform probabilistic inference.
We prepared an indexed B-tree for searching the knowledge base that comprised annotated records from eleven corpora available from NCBO Bioportal in 2009 (Table 1). Then, 220 ontologies were obtained from the NCBO BioPortal; for caching sufficient statistics when searching through the literature, the dictionary of all available ontology concepts (4,153,358 terms) was obtained. More details on preparation of this B-tree and the ontology data are provided in Kshitij et al.21
In this work, a query consists of a concept of interest (the context under which linkages are identified) and two ontologies (containing the terms between which linkages are drawn). One ontology is designated the source ontology; the other is designated the destination ontology. Users define elements of their queries based on their applications. For example, a researcher interested in obesity-related phenotypes and genes might choose “obesity” for the context and Human Phenotype Ontology and Gene Ontology for the two ontologies.
Based on the query, a tree-augmented naïve (TAN) Bayesian network is constructed, where each node is a random variable that represents the state a specific concept takes in an annotation (either “exists” or “does not exist”). Nodes corresponding to concepts from the source ontology are the parents of nodes corresponding to concepts from the destination ontology. The root node corresponds to the context concept and is a parent of all nodes in the network. The TAN structure is adopted because its requirement that the root node is a parent of every other node parallels the way the context term is present when every ontological connection is identified. The structure is appropriate for context-specific inference.
The time complexity of learning a TAN structure from data using a maximally weighted spanning tree algorithm25 is O(n2N), where n is the number of features (the number of concepts in both the source and the destination ontologies), and N is the number of samples26. In this study, the data is a large collection of literature annotated by ontologies. However, the large size of biomedical ontologies renders the use of the original TAN learning structure25 infeasible. Hence, this framework uses a Bayes factor23 (BF) to identify linkages between any source concept S and any destination concept D under a given context concept C7. The higher the Bayes factor, the larger the magnitude of association between two random variables. Therefore, we prune many weak linkages between S and D under C and map only S and D concepts that share a mutual BF greater than a threshold value.
In order to calculate the BF between any S and D considering the context C, we first create a 2-by-2 contingency table. Each element of this table is determined from the frequencies of co-occurrences of S and D in literature that contain C. Let n be the number of documents, its subscript (S, D, or C) be the type of concept being counted, and the superscript (+ or −) be the state of the concept, where a plus sign (+) signifies “exists” and a minus sign (−) signifies “does not exist.” The contingency table contains n++, n+−, n−+, and n− −. Counts are obtained through full-text searches of the knowledge database (Table 1) and are used to calculate BF using the procedure described in Albert27. However, BF is not calculated for every pair of S and D: the hierarchical structure of ontologies allows a more efficient depth-first branch-and-bound algorithm28 to be used to traverse the two ontologies.
After all significantly co-occurring pairs of S and D under the context C are linked, in accordance to the TAN structure, every concept in the network is linked to the context concept as well. The same destination concept may appear several times, each time linked to a different source concept, because of the TAN requirement that nodes have no more than one non-root parent. The different instances of the same destination concept are not considered as one node because keeping them separate drastically facilitates probabilistic inference.
The network must next be associated with probabilities. For each concept in the net, a table containing the conditional probabilities that each of its states (“exists” or “does not exist”) is true is determined for all combinations of states its parent nodes can take. The conditional probability values are derived from the counts of different combinations of the states of the concept in question and its parents in annotations. For example, P(C+) = nC+/(nC+ + nC−) is one context probability value, P(S+|C+) = nS+C+/(nS+C+ + nS−C+) is one source probability value, and P(D+|S+C+) = nD+S+C+/(nD+S+C+ + nD−S+C+) is one destination probability value. Queries are performed, and counts are collected in the same way as when calculating the Bayes factor to build the network structure. Based on transitive closure of concepts in ontologies, we used the same depth-first branch-and-bound procedure described in Kshitij et al.21 to prune the ontologies and cache the statistics. This pruning makes the Bayesian network construction efficient enough and scalable to the size of biomedical ontologies.
The final product is thus a three-tiered TAN Bayesian network with the context term at the root, source ontology terms as intermediates, and destination ontology terms as the leaves, related to one another by conditional probabilities based on the frequencies of their co-occurrence in annotations of literature and of experimental data.
The power of these networks comes from Bayesian inference. Because nodes are linked by probabilities, given the prior probability distribution of the root nodes, predictions can be made about the states of any of the other nodes. In this study, Pearl’s message-passing algorithm in trees25 is implemented so that state information about one or more nodes can propagate along the graph edges and influence the probabilities of the states of other nodes. For example, if certain biological concepts are known to be affected, expressed, or active, the nodes corresponding to those concepts are set to true (P(exists) = 1). The tree is then updated to reflect this new knowledge, and P(exists) values of all other nodes change accordingly.
The power of these networks lies in Bayesian inference. Because nodes are linked by probabilities, predictions can be made about the states of any of the other nodes given the prior probability distribution of the root nodes. Pearl’s message-passing algorithm29 is implemented so that state information about one or more nodes can propagate along the graph edges and influence the probabilities of the states of other nodes. For example, if certain biological concepts are known to be affected, expressed, or active, the nodes corresponding to those concepts are set to true (P(exists) = 1). The tree is then updated to reflect this new knowledge, and P(exists) values of all other nodes change accordingly.
One application of the constructed networks and the proposed inference algorithm is the identification of source or destination concepts in the network that are related to the context C. To measure the relatedness of a term T to the context, we associate it with a likelihood ratio L:
To calculate L, state of the context node is set to “exists,” and the states of all other nodes are left unknown. Beliefs are then propagated, and P(T exists|C exists) is found for each node. The context node is then set to “does not exist,” and the other nodes are still left with unknown states. Again, beliefs are propagated, and P(T exists|C does not exist) is found for each node. L is the ratio of those two probabilities. It measures how much more likely it is that T is true when C is true than when C is false, not simply how likely it is that the two terms coexist. Hence, a general term such as “disease/disorder” would not score a high L because there would be little difference in the probability that it exists whether or not the context is true. That is, the terms with the highest L are most likely to be related specifically to the context and are therefore terms of interest.
A p-value can also be found for each T-C link. First, the Bayes factor that T and C are associated is determined again in the manner described in Section 4.2, except this time the contingency table contains nT+C+, nT+C−, nT−C+, and nT−C−. Using that BF, an upper bound for the p-value can be determined as follows36, where p < 1/e:
This study examines the terms with the highest L as the “most related” and then separately associates them with p-values for ease of understanding. This choice was made because L, which is based on probabilistic propagation over the network, considers all terms in a high-dimensional joint distribution, whereas using Bayes factor or p-value as the final mapping is essentially a univariate, deterministic linkage from source to context. Without using L, the benefits of considering intricate, multivariate biomedical relationships represented in the network would be lost.
Inference using the Bayesian framework is not limited to identification of connections between ontological concepts. For example, the framework can be used to identify the genes most relevant to a queried context. To do so, networks are built with Gene Ontology (GO) as one of the two queried ontologies, and GO concepts in the network are linked with relevant genes based on gene set information from MSigDb46. Because links between genes and GO concepts are deterministic, this additional gene level is not actually part of the probabilistic inference framework, and genes do not correspond to network nodes. Therefore, the inference procedure for finding genes relevant to a given context cannot rely on belief propagation. Instead, relatedness of genes to the context are determined based on network structure alone. For each gene, a one-sided Fisher’s exact test is used to determine whether there is a significant difference between the proportion of GO terms in the network (which are presumably related to the context) that are associated with that gene and the proportion of GO terms outside the network (which are conversely not strongly related to the context) that are associated with the gene. The Benjamini-Hochberg method was used to control for the false discovery rate from multiple hypotheses47.
In this work, we use the proposed learning network structure and inference procedures to identify diseases and genes related to specific pathologies of interest. The context concept is set to be the pathology, the source ontology is set to be Human Disease (DOID), and the destination ontology is set to be Gene Ontology (GO). Both ontologies are available through NCBO BioPortal2,3. That way, all biomedical relationships represented in the Bayesian network are specific to the context or pathology of interest. Since, as explained in Section 4, inference is done over this context-specific network, all identified relationships are tailored for the context.
Networks were constructed and analyzed for diverse context pathologies, including several cancers, substance abuse disorders, obesity and heart disease, and HIV/AIDS. Similar patterns were observed in the results of all contexts; results from contexts “alcoholism” and “obesity” are discussed further in this paper. The network built using context “obesity” is shown in Figure 2. In the rest of Section 5, italicized body text represents terms or genes identified by the algorithm as associated with the context.
The disease concepts with the strongest links to the context “alcoholism” using (1) as a measure of link strength are indeed closely biologically related to alcoholism (Table 2). Alcoholism is a substance-related disorder, is a form of addiction, and would be associated with alcohol-related disorders NOS. Alcohol consumption is known to interfere with the nervous system, leading to impaired perception, coordination, memory, and judgment, all possible components of organic mental disorder of unknown etiology38,39. Psychotic disorders such as schizophrenia occur more frequently in alcoholics than in nonalcoholics40,41, and alcohol consumption can lead to tauopathies, diseases involving aggregation of abnormal tau protein in the brain42, such as Alzheimer’s dementia43. Moreover, environmental factors such as socioeconomic status or education quality play major roles in the development of alcoholism, a disease of environmental origin or environmentally induced disease44, and there exists a high comorbidity between drug abuse problems and alcoholism45.
The gene inference procedure (Section 4.4) identified many promising genes as significant. For example, a number of genes had already been found by other studies to be associated with the context of alcoholism, including PTGDS (P < 10−15), the gene with the lowest p-value; MIF (P < 10−13); BRCA1 (P <10−13); IL4 (P < 10−7); and the three types of peroxisome proliferator-activated receptor genes (PPARs), PPARA (P < 10−15), PPARD (P < 10−15), and PPARG (P < 10−5). PTGDS codes for prostaglandin D2 synthase, which is negatively correlated with alcohol intake48. In liver tissues affected by alcoholic liver disease, serum levels of macrophage migration inhibitory factor, coded by MIF, are elevated49, while alcohol inhibits IL4, which controls B-cell proliferation and immunoglobulin class switching50,51. Alcohol consumption is associated with heightened incidence of breast cancer, and ethanol down-regulates BRCA1, the second most likely gene, of which mutations are closely linked to breast cancer52,53. Both PPARA and PPARD are downregulated by ethanol, PPARD agonists alleviate alcohol-induced liver damage, and PPARG activation may suppress addictive drinking behaviors54. Other significant genes, such as the transcription factor gene TCF7 (P = 6.027 × 10−10), have not yet been linked in a molecular biological study to ethanol; however, they have been found in other bioinformatics studies to be significantly associated with ethanol or with alcohol withdrawal55 and therefore are encouraging targets for future biological studies of alcohol dependence.
When the same procedure was conducted with the context “obesity,” the algorithm just as successfully identified diseases relevant to the context concept (Table 3). Obesity, unspecified is a synonym of the concept itself; morbid obesity, defined as weighing 45 kg or more above the ideal weight or having a BMI of at least 40, is a subset of the context56. Polyphagia, an eating disorder characterized by excessive consumption of food, can cause weight gain and lead to obesity. Alcohol intake is another potential cause of obesity57, and both alcohol and obesity are associated with fatty liver disease58 (alcoholic liver damage, alcohol induced liver disorder). Obesity increases the risk of cholelithiasis, the development of gallstones, especially during the weight loss process59 and is highly associated with polycystic ovary disease (ovarian dysfunction, ovarian non-neoplastic disease), with around 30% of individuals with polycystic ovary disease being obese60. There exists a genetic disorder, Ayazi syndrome, characterized by obesity, choroideremia, and congenital deafness61.
Similar to the case of the context “alcoholism,” the proposed method identified as significant a promising mix of already-corroborated and potentially-related genes for the context “obesity.” For instance, TGFB1 (P < 10−7), a tissue growth factor that regulates proliferation, migration, and differentiation of diverse cells, is linked to abdominal obesity and insulin and glucose imbalance62. PPARD (P < 10−7) activates other genes that direct fatty acid catabolism and thermogenesis; underexpression of PPARD results in obesity63, while PPARD agonists mimic exercise and make promising targets for treatment of metabolic syndromes63–65. UBB (P < 10−6) is one of several genes that codes for ubiquitin, a protein-recycling regulator involved in lipid metabolism and whose levels are inversely associated with BMI66. Indeed, mice lacking the UBB gene exhibit adult-onset obesity67. CARTPT (P < 10−6) encodes hypothalamic satiety factors66, the dysregulation of which may lead to overeating, and FADS1 (P < 10−10), which codes for fatty acid desaturase, is related to lipid metabolism and the plasma triacylglycerol response69; both genes easily might relate to obesity. One interesting find was YWHAH (P < 10−9). Polymorphisms of YWHAH are associated with schizophrenia, and antipsychotic drugs70, including schizophrenia medications, are known to induce obesity71. Perhaps, YWHAH is a missing link in knowledge that this method has identified.
Our technique can be seen as the first “automatic” probabilistic inference algorithm that uses large biomedical ontologies in conjunction with the vast corpus of existing biomedical literature and experimental data to address specific queries. Of the many probabilistic Bayesian frameworks proposed so far, only this one uses context-specific formulae to map concepts and to calculate conditional probabilities and specializes on a context-specific Bayesian network structure for inference. Therefore, inference using this technique is customized for the researcher’s interests than inference using previous methods.
In this particular work, the proposed framework was effectively used to identify disease concepts and genes related to a context pathology of interest using existing knowledge embedded in literature and in ontologies. Identified disease concepts were invariably closely related to the context. Many of the genes the method identified likewise were known to be associated with the context pathology. Of the remaining genes, many had functions that could logically link them to the context concept or had been identified by other bioinformatics studies as differentially expressed in individuals exhibiting the context pathology. Such genes are promising and interesting because they may constitute new links that augment existing knowledge. The disease concepts and genes identified here may seem to be new information for a researcher with a specific query but no prior information.
All inferences are drawn from the annotated knowledge base that the framework uses as data. Therefore, it is critical that the annotation methods and the selection of data and literature included are comprehensive and representative. We must assume that our knowledge base satisfies the previous condition. Nonetheless, this work advances our ability to generate inferences from such bases. Because literature, experimental data, and ontologies continually evolve, the database must be up-to-date to comprehensively use the prior biomedical knowledge available. We intend to fully automate the data-preparation process in the future and integrate it with the inference framework presented here.
The proposed algorithm can be enhanced to improve its utility as an inferential tool. In this work, a query consists of a context concept of interest and two ontologies of terms from which connections are drawn. In future studies, we intend to extend the algorithm to be able to handle more complex queries. Additionally, future work can examine the predictive power of the framework in identifying drug-disease, drug-pathway, and pathway-disease relationships.
This work was supported by NIH grants 5R21DA025168-02 (G. Alterovitz), 1R01HG004836-01 (G. Alterovitz), and 4R00LM009826-03 (G. Alterovitz)