Every year, over 400,000 new articles reportedly enter biomedical literature [1
]. This staggering growth of biomedical findings has created an unprecedented corpus of knowledge that is impossible to explore with traditional means of literature consultation and database searches. This information overload has motivated the development of structured information repositories that organize biomedical findings according to hierarchical ontologies.
Ontologies find themselves at the heart of two major complementary activities in biomedical research. Communities of researches create and maintain these ontologies to represent different types of entities and relations in different domains of biomedicine. On the other hand, biomedical experimentalists use ontologies to annotate data in order to facilitate data integration and translational discoveries. This activity is greatly intensified by the development of high-throughput experimental platforms such as gene expression microarrays [2
], SNP microarrays [3
] and next generation sequencing platforms [4
The rise of such ontological organization has created a new problem, the proliferation of disparate and seemingly unrelated biomedical ontologies. For example, the National Center of Biomedical Ontology’s (NCBO) BioPortal [5
] provides over 200 such ontologies to researchers. These ontologies are generally used by scientists to annotate their data, but which ontologies to use and how they relate to each other is generally unclear. What is needed is the integration of these conceptualizations in a principled fashion, a “grand unification” of biological terms. It has been established [6
] that the integration of these available ontologies will have a tremendous impact on the advancement of biomedical sciences. These integrated ontologies will provide a complete basis of biomedical knowledge representation and act as a foundation for inference on new biomedical data. Furthermore, a quantitative approach for integration would make the navigation of the complex space of ontologies more amenable to researchers by offering them guidance to numerous links among ontologies, ranking them according to a principled metric, thus making the discovery process faster and efficient.
To date, the mapping and integrating of ontologies in the biomedical domain has relied on discovering links between syntactically and semantically similar terms across ontologies [7
]. Such an approach can relate terms with similar meanings but would not deduce any relationships between seemingly disparate functional spaces such as diseases, drugs and anatomy. Approaches in the data integration community for ontology integration use methods ranging from machine learning [8
] to graph matching [9
] to natural language processing [10
]. These methods again inherently focus on mapping synonyms across ontologies. Recently, Ontology Alignment Evaluation Initiative [11
] has been launched as a competition between alignment algorithms on a given standardized dataset. These methods generally cater to the definition of traditional ontology alignment considering synonyms. Even instance-based methods in these initiatives for mappings have the goal of converging two ontologies that represent the same knowledge base. For domains as disparate as biomedical ontologies, such methods do not work and moreover, the computational complexity of these algorithms makes them infeasible for massive scales of such vocabularies. Other approaches to infer these links use standard means of manual curation, which is again a tedious and labor intensive task with extremely bad scaling properties.
Here we propose a novel computational and methodological framework for context-specific integration of biomedical ontologies using free-text literature analysis. We model context specificity using another ontology and derive context-dependent functional links between ontological concepts occurring as phrases in free-text literature. We cache massive amounts of literature data to enable efficient counts of co-occurring ontology terms. Based on these statistics, the penalized likelihood of the model of dependency and independency is computed by applying the well-known bayesian information criterion [12
] over a context-sensitive model scoring function. We account for scalability via a depth-first branch and bound heuristic technique, to prune sub-graphs that do not yield significant links.
We believe that such a methodological approach would turn machine-processable ontologies into a single landscape of integrated biomedical concepts and annotations. This would enable researchers to bear on each single finding the entire power of established biomedical knowledge.