More than 700,000 biomedical articles were published in 2009 and indexed in MEDLINE [1
]. At this rate, an internist like other medical specialists, in fact would have to read at least 20 scientific papers every day to keep up-to-date with this overwhelming number of yearly citations [2
]. To address this information explosion, different kinds of search engines, such as PubMed [3
] or HubMed [4
], for instance, supplement biomedical literature databases. These engines provide instant access to the biomedical literature. However, the large volume of available citations often leads to the retrieval of too many results. PubMed, for example, retrieves citations ordered by date, but this criterion is rarely the main attribute used to measure the relevance of a biomedical publication. In practice, users need to perform further filtering and query modifications to retrieve the results that best suit their personal needs which might not be listed on the first page of results.
Looking back to the early years of the pioneering information retrieval systems, the number of records was much more limited. Furthermore, exact word-matching was usually enough to find the specific document that a user was searching for [5
]. Nowadays, professionals require more advanced document indexation architectures and advanced natural language processing (NLP) techniques to handle the current information growth on the Web. To address this new situation, traditional information retrieval systems have evolved into modern search engines that can work in highly dynamic and large environments, e.g. Google [6
], Yahoo [7
] or Bing [8
The above search engines were initially based on simple keyword-based queries. However, such queries were not efficient enough given the increasing complexity of the structures and contents of the current Web. Thus, user-related data, i.e., user contexts, such as physical location or query and web history, for instance, had to be built into commercial search engines since relevance measures are different for each user [9
]. Regarding search engines for biomedical literature, recent efforts focused on improving citation retrieval from MEDLINE. For instance, systems such as HubMed [4
], highlights articles that contain the search terms within the title or abstract, or Relemed [10
] which uses sentence-level concurrence as a relevance metric for multi-word queries, askMEDLINE [11
] allows free-text, natural language queries or GoPubMed [12
] classify citations using Gene Ontology terms.
In other areas, the inclusion of user context information has improved the performance of search engines [13
]. For instance, IntelliZap [15
] addresses textual contexts. The input for this search engine tool is a set of keywords and a text based on natural language, whereas the output is a number of results, also related to the textual context. Other information retrieval systems can personalize their results according to user history [16
], ontological knowledge [18
] or user profiles [20
], e.g., to perform different result rankings depending on user profile clusters [21
In clinical practice, it is worthwhile examining a similar approach, i.e., considering user context information for retrieval purposes. For clinical applications, we might use information available within hospital information systems (HIS) and electronic health records (EHR) [22
], for instance. In this regard, projects such as MorphoSaurus [23
] and others [24
] focused on concept-based and semantic approaches to improve medical information search and retrieval. InfoButtons [25
] have been widely used to provide links to online resources from EHRs with MEDLINE among these resources [26
]. However, there is still a lack of tools performing complex queries to retrieve biomedical literature. The process of query building, incrementally including terms for advanced filtering, is usually performed manually. We developed CDAPubMed to address this issue, aiming to provide a tool to semi-automatically build such complex queries.
The objective of CDAPubMed is to use the contents of EHRs to provide additional information for improving and personalizing biomedical literature searches. For instance, let us assume that there is a clinician searching biomedical citations for a given disease in a particular patient. The traditional process would be to access PubMed and search all the relevant citations by typing the specific keywords that the clinician considers most relevant for this case. After an initial usually generic search, he or she needs to reduce the scope of the search by using more specific terms related to a particular patient, until he or she finds the most relevant citations. This is usually a multi-step process and can, in fact, be tedious and time-consuming. In contrast, we suggest that the use of automated techniques can help this clinician to automatically find and filter these results, by linking citations with specific terms from the EHR that can be used to refine and improve query precision.
Following this approach, CDAPubMed is intended to help clinicians, researchers or other users to retrieve publications focused on their patients. When a user enters an EHR and a disease, the tool suggests and selects keywords within the EHR to filter the results. To accomplish this objective, the tool has to implement two main tasks: (i) automatic analysis of the EHR to identify relevant terms for literature retrieval, and (ii) generation of search engine queries to retrieve publications related to the EHR. Here we report our experience in developing CDAPubMed (available at http://porter.dia.fi.upm.es/cdapubmed/
), and results showing its potential for EHR-based retrieval of biomedical literature.