In addition to the manual annotation tool for biologists to create annotated links between models and full papers, the PathText Integrator includes a method to retrieve new papers relevant to specific parts of a model through the use of TM systems. The only way of linking parts of a model with such an implicit set of papers is in the form of queries, by which each of the individual TM systems retrieves a set of text. Because the results returned by each TM system have their own semantic annotations, the Integrator needs to interpret the annotations in retrieved text to identify the portions of text relevant to the model and visualize them.
We integrate three TM systems (MEDIE, KLEIO and FACTA) in PathText, each of which has different characteristics, its own strengths and weaknesses. The crucial considerations are how to maximize the effectiveness of the system by exploiting the characteristics of these TM systems, and how to reduce the burden of a user in interaction with these TM systems.
A pathway represents a specific biological context in which species (genes, proteins, enzymes, etc.) and relations among them (phosphorylation, binding, degradation, etc.) occur. When a user clicks a specific node in a pathway, she/he is most likely to be not interested in the gene, protein, enzyme, etc. in general that the node represents. Instead, they are interested in the gene, protein or enzyme in the specific context. Such rich contextual information should be used in query formation to improve the precision. MEDIE, the query language of which is highly expressive, provides the means by which such rich contextual information is embedded in a query. The integrator in PathText stores the associations between nodes/relations and complex query formulas. The complex queries and their associations are established in advance during the model curation phase. On the other hand, while such fixed queries embedded in a pathway model in advance are effective, a biologist wants to navigate a set of documents rather freely. The general query format which MEDIE provides is, though expressive, too complex. Interactive query formation provided by KLEIO, including rich semantic annotations and facet-search based on them, is ideal for such free navigation of document sets.
The other functionality, which we have found extremely useful, is to allow a user to formulate a query on the fly by using visualization of pathway. That is, she/he clicks an arbitrary number of nodes in a pathway and then, the system returns a set of documents in which the species those nodes represent co-occur, together with distribution statistics of the other species in the pathway. FACTA provides this functionality with its specialized indexing.
) (Miyao et al.
) is an intelligent search engine to retrieve biomedical relational information from a large textbase created from MEDLINE. , shows MEDIE text mining search results in PathText. The textbase stores the whole MEDLINE abstracts, with annotations represented by XML-like text markup for both the metadata of the articles provided by NLM (MeSH terms, publication date, etc.), and analysis results by various NLP modules. The annotated text is indexed for efficient structure search by extended region algebra (ERA) (Masuda and Tsujii, 2008
). A query with high precision can be formulated by using NLP analyses results and the region algebra.
Fig. 4. MEDIE text mining search results shown in the PathText GUI. A sentence which contains a biological event is shown together with the whole abstract in which the sentence appears. The biological event correspond to a link in a pathway which the user clicks. (more ...)
The NLP modules used for the annotation include (but not limited to) a deep syntactic analyzer, an event expression recognizer (EER) and a term recognizer. The syntactic analyzer, Enju parser (Miyao et al.
), produces a syntactic and semantic analysis of the text, based on a linguistic formalism called HPSG (Pollard and Sag, 1994
). A relational concept, such as ‘protein A activates protein B’, can be precisely described as a query which specifies the semantic structure given by the Enju parser as constraint (see and ). This is the main strength of MEDIE compared to other publicly available TM modules which use Boolean formula of keywords or concepts for query formulation. Boolean formulas basically specify co-occurrence of concepts or words as constraint for retrieval. One can only specify co-occurrence of protein A, protein B and the verb ‘to activate’ in the same textual unit (usually an abstract) as constraint, which results in a large number of false positives.
Fig. 5. A semantic relation expressed in different textual expressions: these two tree structures represent two sentences with different voices (active and passive), which essentially describe the same event. The identity of the event is captured by the predicate-argument (more ...)
Fig. 6. An example of an ERA query: >> is a operator. [R1] >> [R2] means a text span tagged by R1 should contain another text span tagged by R2. $sbj and $obj are variables. They are used to express the dotted lines in (more ...)
Units of retrieval in MEDIE are finer than those in other TM modules. They can be individual sentences in abstracts, or even phrases. Furthermore, the ERA allows us to specify constraints on context from constraints on units of retrieval. That is, we can formulate a query for retrieving sentences which contain a specific biological event (e.g. *Protein A* activates *protein B*) and which appear in abstracts with certain keywords or other biological events reported. This separation of units of retrieval and context is extremely useful for PathText to specify constraints embodied by the neighboring parts of a pathway network.
The semantic representation produced by Enju also works as an intermediate language which bridges the gap between a search query and the textual expression in the article. That is, a single semantic relation is represented in the same way in the semantic representation level, across various different textual expressions, and hence retrieved with a single query. See for an example of a semantic relation ‘protein A activates protein B’, expressed differently in surface textual form, in which the semantic subject (ARG1) and semantic object (ARG2) of the verb ‘activate’ is represented in the same way in the semantic representation level (indicated by the dashed arrows).
The EER and the term recognizer further enhance the search capability of MEDIE by introducing another level of abstraction of semantics. They map surface textual expressions of biological events or technical terms to the corresponding concept identifiers defined in ontologies. The EER recognizes biological molecular events mentioned in text and map them to identifiers of event types defined in terms of Ontology (Ashburner et al.
). The current version of EER distinguishes 35 event types in GO, which include binding, positive/negative regulation, etc. Using the annotations by Enju and the EER, we can retrieve sentences in which a biological event of ‘positive regulation of protein A by protein B’ is reported, even though they may be expressed in diverse surface expressions like ‘A activates B’, ‘B is induced by A’, and so on. The domain specific lexical knowledge, like the synonymy of ‘activate’ and ‘induce’ in the molecular biology domain, was collected from the GENIA Event corpus (Kim et al.
The term recognizer detects gene, protein and disease names in the text, and assign unique database IDs to differently expressed entity names (i.e. synonyms). The gene/protein IDs are taken from a gene/protein meta-DB, Gena (Koike and Takagi, 2004
) and the disease IDs are taken from UMLS. By combining the annotations given by the term recognizer with ones by the parser and the EER, we can recognize a biological event in even when an entity (protein or gene) involved there is mentioned in different names.
The index for MEDIE is based on the ERA. In the ERA, we can specify a semantic relation encoded as the topological relations (e.g. a text span includes another text span, a text span follows another text span, etc.) among textual spans and annotations. Structural relations can also be directly represented by linking variables. For example, to retrieve the sentences that mention a binding event between ‘protein A’ and ‘protein B’, we formulate a query that has three key phrases: ‘protein A’, ‘protein B’ and ‘bind’ among which a semantic relation ‘protein A binds to protein B’ holds. The query in the ERA is shown in .
MEDIE accepts a search query through a WEB-API, in addition to an interactive search UI. The API takes a tuple of <subject, verb, object> as the input, which describes a biological event/relation, such as <p53, activate, beta-4>, and returns a set of articles in which the event/relation is mentioned. The tuple is internally translated to an ERA query, using the same gene/protein dictionary and event expression dictionary used in the above-mentioned NLP modules. For example, the tuple <p53, activate, beta-4> is translated to the following region-algebra query shown in . The WEB-API thus hides irrelevant details of the backend database from the viewpoint of the users, such as the annotation schemes used in the NLP modules or the dictionary used in developing it. A specification on the meta-data part such as journal titles can be expressed as additional fields to the subject-verb-object tuple.
An ERA query for <subject, verb, object>=<p53, activate, beta-4>.
) (Nobata et al.
) uses the results of named entity recognition to provide a range of semantic search functions. A standard indexing tool, Lucene, is used to generate an index over the terms for proteins, genes, metabolites and medical terms that have been recognized. This is an index of the concepts that are referred to in the text, rather than individual, or canonical word forms. This functionality allows us to retrieve documents that refer to a specific concept, although the surface form used may differ in each case, as in the use of orthographic variants, or acronyms instead of their expansions. In KLEIO, full forms of named entities, including variants, are linked to their acronyms via an acronym recognition and disambiguation process (Okazaki and Ananiadou, 2006
). The system also offers document retrieval based on the unique identifier for a concept, providing a link back to the original databases from which the system's dictionary was generated. In addition, by further classifying terms into semantic categories the system allows the user to specify a specific concept, by associating a semantic category with a query term. This can radically reduce the search space. For example, more than 60 000 documents were returned when the word ‘cat’ was given as a query, due to its ambiguity. However, when the query was modified to specify the desired semantic category for ‘cat’ e.g. PROTEIN, a more focused result is returned. For the query ‘PROTEIN:cat’, 200 documents were returned. Moreover, the documents returned by the initial query are dynamically organized into semantic facets based on the named entities recognized both in the query and occurring in the same immediate context in each document retrieved. The user may thus refine the initial query by combining concepts from the offered facets or may pursue the links to the document representations. The documents themselves are presented with concept markup on all the recognized terms.
As with MEDIE, KLEIO stores the whole set of abstracts from MEDLINE together with metadata provided by the National Library of Medicine and augments these data with rich semantic annotations. Semantic annotation in KLEIO is much richer in semantic categories of named entities than those of MEDIE, though it does not have syntactic/semantic annotations of sentences. The normalized identifiers which KLEIO uses are, therefore, not only UniProt identifiers and UMLS identifiers, but also HMDB and DrugBank identifiers for small molecules and metabolites which are crucial for integrating metabolic pathways. Acronyms, which are pervasive in biological papers, are also disambiguated (Okazaki et al.
) and normalized into identifiers if the disambiguated results belong to the semantic categories which KLEIO is able to deal with. Because of surface word indexing and richer semantic categories, KLEIO is used as a fall-back system when species in a model are not covered by MEDIE. KLEIO accepts PathText queries through a WEB-API. The API accepts space separated terms as well as Boolean queries, for example ‘p65 AND beta4’. KLEIO then returns a set of articles relevant to the query in XML containing PubMed IDs and the abstract highlighted with the terms matching the query.
) (Tsuruoka et al.
) is an information retrieval system with a usage very different from MEDIE and KLEIO. It takes large sets of articles (the whole MEDLINE in the current version of FACTA) to find implicit associations between named entities, by using statistical measures of co-occurrences of entities in the same articles. It can find and show a biologist a list of genes, for example, which would be relevant to a given disease.
FACTA was originally designed as an interactive system to show the user such a list of entities on the fly, and special care was taken to compute the statistical measures very quickly. More specifically, it builds a special data structure called inverted index that allows for efficient access to articles in which a particular set of entities appear. This data structure enables the system to compute co-occurrence statistics on the fly even if the input entities appear in a large number of articles. For example, FACTA can produce a ranked list of genes and proteins that co-occur with ‘p53’, which is mentioned in more than 46 000 articles, in 0.04 s. Combined with the PathText interface, FACTA allows the user to select an arbitrarily subset of the species in the pathway and immediately find the information about which other species co-occur with them in the literature () (see Section 6.2
Fig. 8. Payao GUI showing PathText manual search results discovered using FACTA. The two green colored nodes are the nodes clicked by a user. FACTA retrieves a set of documents in which these two species co-occur. The red icons show that these species also occur (more ...)
It should be noted that such co-occurrence statistics cannot be computed off-line, because the user is allowed to specify any combination of the species as the input. This is the reason why PathText needed to integrate the functionality of FACTA, which is tuned for real-time uses with a special index structure. In contrast, retrieval of articles by MEDIE and KLEIO is performed in batch mode (i.e. off-line) and the results are attached to the relevant part of a model in the repository of the Integrator (see Section 6.1
Unlike MEDIE or KLEIO, FACTA currently uses a simple longest matching algorithm to recognize gene/protein names in the literature. The dictionary was created from BioThesaurus (Liu et al.
) with some manual curation efforts including the removal of noisy and highly ambiguous entries.
FACTA runs on a generic Linux server with 2.2 GHz AMD Opteron processors and 16 GB memory, on which all the inverted indexes are stored.