The amount of biomedical literature is vast and growing rapidly. It has become impossible for researchers to read all publications in their field of interest, which forces them to make a stringent selection of relevant articles to read. To keep abreast of the available knowledge, a wide range of initiatives has been deployed to mine the literature, from manual encoding of gene relations by the Gene Ontology Consortium [1
], to automatic extraction of specific information such as transcript diversity [2
], to the use of literature data for the prediction of disease genes [3
] (see [5
] for recent reviews). One of the emerging approaches is text-mining, which infers associations between biomedical entities by combining information from multiple papers. Text-mining approaches typically rely on occurrence and co-occurrence statistics of terms and have been successfully applied to a number of problems. The classic application is for literature-based knowledge discovery, which attempts to link disjunct sets of literature in order to derive promising new hypotheses [7
]. Swanson (see, for example, [12
]) was a pioneer in this field and was able to publish several new hypotheses derived with the help of literature mining. His well known first example was the hypothesis that Raynaud's disease could be treated with fish oil [13
], which was later corroborated experimentally [14
]. Another field to which text-mining has been successfully applied is the analysis of DNA microarray data [15
]. With microarray experiments, hundreds of genes can be identified that are relevant to the studied phenomenon. The interpretation of such gene lists is challenging as, for a single gene, there can be hundreds or even thousands of articles pertaining to the gene's function. Text-mining can alleviate this complication by revealing the associations between the genes that are apparent from literature. This was the focus of the earlier version of Anni [18
Here we present Anni 2.0, a tool that provides an ontology-based interface to the literature. The tool is aimed at a broad audience of biomedical researchers and facilitates traversing the huge corpus of biomedical literature efficiently to answer a broad range of information needs, including those for the interpretation of high-throughput datasets. Anni's functionality is based on the use of an ontology, which defines concepts, such as genes, biological processes and diseases, and their relations. Concepts come with a definition, a semantic type, and a list of synonymous terms and can be linked to online databases. We identify references to concepts in texts with our concept recognition software Peregrine [19
]. The idea behind Anni is to relate or associate concepts to each other based on their associated sets of texts. Texts can be linked to a concept through automatic concept recognition, but also by using manually curated annotation databases. The texts associated with a concept are characterized by a so-called concept profile [18
] (see Figure for an introduction into the technology behind Anni). A concept profile consists of a list of related concepts and each concept in the profile has a weight to signify its importance. Concept profiles have been successfully used to infer functional associations between genes [18
] and between genes and Gene Ontology (GO) codes [21
] to infer novel genes associated with the nucleolus [22
], and to identify new uses for drugs and other substances in the treatment of diseases [8
The technology behind Anni at a glance. Yellow balls indicate ontology concepts.
Anni 2.0 provides a generic framework to explore concept profiles and facilitates a broad range of tasks, including literature based knowledge discovery. The tool provides concepts and concept profiles covering the full scope of the Unified Medical Language System (UMLS) [23
], a biomedical ontology. The user is given extensive control to query for direct associations (based on co-occurrences), to match concept profiles, and to explore the results in several ways, for instance with hierarchical clustering. Several types of ontological relations can be used in Anni. Semantic type information, which indicates whether a concept is about, for example, a gene or a drug, can be used to group concepts. This allows, for instance, a query as to whether a gene of interest has an association with any of the available diseases. Hierarchical 'parent/child' relations are also available and can be visualized. They can be used to explore the relations in a group of concepts or to expand a query by identifying relevant related concepts in the hierarchy. An important feature of Anni is transparency: all associations can be traced back to the supporting documents. In this way, Anni can also be used to retrieve documents about concepts of interest, thereby exploiting the mapping of synonyms and the resolution of ambiguous terms by our concept recognition software.
Previously, we illustrated the utility of concept profiles to retrieve functional and relevant associations between various types of concepts [18
]. Here, we evaluate our tool through two use cases. First we use Anni to analyze a DNA microarray dataset. Second, we attempt to reproduce and expand a published literature-based knowledge discovery.