Human genome sequencing marked the beginning of the era of large-scale genomics and proteomics, leading to large quantities of information on sequences, genes, interactions, and their annotations. In the same way that the capability to analyze data increases, the output by high-throughput techniques generates more information available for testing hypotheses and stimulating novel ones. Many experimental findings are reported in the -omics literature, where researchers have access to more than 20 million publications, with up to 4,500 new ones per day, available through to the widely used PubMed citation index and Google Scholar. This vast increase in available information demands novel strategies to help researchers to keep up to date with recent developments, as ad hoc querying with Boolean queries is tedious and often misses important information.
Even though PubMed provides an advanced keyword search and offers useful query expansion, it returns hundreds or thousands of articles as result; these are sorted by publication date, without providing much help in selecting or drilling down to those few articles that are most relevant regarding the user's actual question. As an example of both the amount of available information and the insufficiency of naïve keyword search, the name of the protein p53
occurs in 53,528 PubMed articles, and while a researcher interested specifically in its role in cancer
and its interacting partners might try the search “p53 cancer interaction
” to narrow down the results, this query still yields 1,777 publications, enough for months of full-time reading [1
]. Nonetheless, PubMed is a very widely used free service and is providing an invaluable service to the researchers around the world. In March 2007, PubMed served 82 million (statistics of Medline searches: http://www.nlm.nih.gov/bsd/medline_growth.html
) query searches and the usage is ever increasing. A few commercial products are currently available that provide additional services, but they also rely on basic keyword search, with no real discovery or dynamic faceted search. Examples are OvidSP and Ingenuity Answers, both of which support book-marking as one means of keeping track of visited citations. Research tools such as EBIMed (EBIMed: http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
] and AliBaba (AliBaba: http://alibaba.informatik.hu-berlin.de
] provide additional cross-referencing of entities to databases such as UniProt or to the GeneOntology. They also try to identify relations between entities, such as protein-protein interactions, functional protein annotations, or gene-disease associations.
Search tools should provide dedicated and intuitive strategies that help to find relevant literature, starting with initial keyword searches and drilling down results via overviews enriched with autogenerated suggestions to refine queries. One of the first steps in biomedical text mining is to recognize named entities occurring in a text, such as genes and diseases. Named entity recognition (NER) is helpful to identify relevant documents, index a document collection, and facilitate information retrieval (IR) and semantic searches [4
]. A step on top of NER is to normalize each entity to a base form (also called grounding and identification); the base form often is an identifier from an existing, relevant database; for instance, protein names could be mapped to UniProt IDs [5
]. Entity normalization (EN) is required to get rid of ambiguities such as homonyms, and map synonyms to one and the same concept. This further alleviates the tasks of indexing, IR, and search. Once named entities have been identified, systems aim to extract relationships between them from textual evidences; in the biomedical domain, these include gene-disease associations and protein-protein interactions. Such relations can then be made available for subsequent search in relational databases or used for constructing particular pathways and entire networks [7
Information extraction (IE) [8
] is the extraction of salient facts about prespecified types of events, entities [12
], or relationships from free text. Information extraction from free text utilizes shallow-parsing techniques [13
], part-of-speech tagging [14
], noun and verb phrase chunking [15
], predicate-subject and object relationships [13
], and learned [8
] or hand-build patterns [18
] to automate the creation of specialized databases. Manual pattern engineering approaches employ shallow parsing with patterns to extract the interactions. In the system presented in [19
], sentences are first tagged using a dictionary-based protein name identifier and then processed by a module which extracts interactions directly from complex and compound sentences using regular expressions based on part of speech tags. IE systems look for entities, relationships among those entities, or other specific facts within text documents. The success of information extraction depends on the performance of the various subtasks involved.
The Suiseki system of Blaschke et al. [20
] also uses regular expressions, with probabilities that reflect the experimental accuracy of each pattern to extract interactions into predefined frame structures. Genies [21
] utilizes a grammar-based natural language processing (NLP) engine for information extraction. Recently, it has been extended as GeneWays [22
], which also provides a Web interface that allows users to search and submit papers of interest for analysis. The BioRAT system [23
] uses manually engineered templates that combine lexical and semantic information to identify protein interactions. The GeneScene system [24
] extracts interactions using frequent preposition-based templates.
Over the last years, a focus has been on the extraction of protein-protein interactions in general, recently including extraction from full text articles, relevance ranking of extracted information, and other related aspects (see, for instance, the BioCreative community challenge [25
]). The BioNLP'09 Shared Task concentrated on recognition of more fine-grained molecular events involving proteins and genes [26
]. Both papers give overviews over the specific tasks and reference articles by participants.
One of the first efforts to extract information on bio-molecular events was proposed by Yakushiji et al. [27
]. They implemented an argument structure extractor based on full sentence parses. A list of target verbs have specific argument structures assigned to each. Frame-based extraction then searches for filler of each slot required according to the particular arguments. On an small in-house corpus, they found that 75% of the errors can be attributed to erroneous parsing and another 7% to insufficient memory; both causes might have less impact on recent systems due to more accurate parsers and larger memory.
Ding et al. [28
] studied the extraction of protein-protein interactions using the Link Grammar parser. After some manual sentence simplification to increase parsing efficiency, their system assumed an interaction whenever two proteins were connected via a link path; an adjustable threshold allowed to cut off too long paths. As they used the original version of Link Grammar, Ding et al. [28
] argued that adaptations to the biomedical domain would enhance the performance.
An information extraction application analyzes texts and presents only the specific information from them that the user is interested in [29
]. IE systems are knowledge intensive to build and are to varying degrees tied to particular domains and scenarios such as target schema. Almost all IE applications start with fixed target schema as a goal and are tuned to extract information from unstructured text that will fit the schema. In scenarios where target schema is unknown, open information extraction systems [30
] like KnowItNow [31
] and TextRunner [32
] allow rules to be defined easily based on the extraction need. An hybrid application (IR + IE) that leverages the best of information retrieval (ability to relevant texts) and information extraction (analyze text and present only specific information user is interested in) would be ideal in cases when the target extraction schema is unknown. An iterative loop of IR and IE with user's feedback will be potentially useful. For this application, we will need main components of IE system (like parts-of-speech tagger, named entity taggers, shallow parsers) preprocesses the text before being indexed by a custom-built augmented index that helps retrieve queries of the type “Cities such as ProperNoun(Head(Noun Phrase)).” Cafarella and Etzioni [33
] have done work in this direction to build a search engine for natural language and information extraction applications.
Exploratory search [34
] is a topic that has grown from the fields of information retrieval and information seeking but has become more concerned with alternatives to the kind of search that has received the majority of focus (returning the most relevant documents to a Google-like keyword search). The research is motivated by questions like “what if the user does not know which keywords to use?” or “what if the user is not looking for a single answer?”. Consequently, research began to focus on defining the broader set of information behaviors in order to learn about situations when a user is—or feels—limited by having only the ability to perform a keyword search (source: http://en.wikipedia.org/wiki/Exploratory_search
). Exploratory search can be defined as specialization of information exploration which represents the activities carried out by searchers who are either [35
- unfamiliar with the domain of their goal (i.e., need to learn about the topic in order to understand how to achieve their goal);
- unsure about the ways to achieve their goals (either the technology or the process); or even
- unsure about their goals in the first place.
A faceted search system (or parametric search system) presents users with key value metadata that is used for query refinement [36
]. By using facets (which are metadata or class labels for entities such as genes or diseases), users can easily combine the hierarchies in various ways to refine and drill down the results for a given query; they do not have to learn custom query syntax or to restart their search from scratch after each refinement. Studies have shown that users prefer faceted search interfaces because of their intuitiveness and ease of use [37
]. Hearst [38
] shares her experience, best practices, and design guidelines for faceted search interfaces, focusing on supporting flexible navigation, seamless integration with directed search, fluid alternation between refining and expanding, avoidance of empty results sets, and most importantly making users at ease by retaining a feeling of control and understanding of the entire search and navigation process. To improve web search for queries containing named entities [39
], automatically identify the subject classes to which a named entity might refer to and select a set of appropriate facets for denoting the query.
Faceted search interfaces have made online shopping experiences richer and increased the accessibility of products by allowing users to search with general keywords and browse and refine the results until the desired sub-set is obtained (SIGIR'2006 Workshop on Faceted Search (CFP): http://sites.google.com/site/facetedsearch/
). Faceted navigation delivers an experience of progressive query refinement or elaboration. Furthermore, it allows users to see the impact of each incremental choice in one facet on the choices in other facets. Faceted search combines faceted navigation with text search, allowing users to access (semi) structured content, thereby providing support for discovery and exploratory search, areas where conventional search falls short [40