A major problem faced in biomedical informatics involves how best to present information retrieval results. When a single query retrieves many results, simply showing them as a long list often provides poor overview. With a goal of presenting users with reduced sets of relevant citations, this study developed an approach that retrieved and organized MEDLINE citations into different topical groups and prioritized important citations in each group.
A text mining system framework for automatic document clustering and ranking organized MEDLINE citations following simple PubMed queries. The system grouped the retrieved citations, ranked the citations in each cluster, and generated a set of keywords and MeSH terms to describe the common theme of each cluster.
Several possible ranking functions were compared, including citation count per year (CCPY), citation count (CC), and journal impact factor (JIF). We evaluated this framework by identifying as “important” those articles selected by the Surgical Oncology Society.
Our results showed that CCPY outperforms CC and JIF, i.e., CCPY better ranked important articles than did the others. Furthermore, our text clustering and knowledge extraction strategy grouped the retrieval results into informative clusters as revealed by the keywords and MeSH terms extracted from the documents in each cluster.
The text mining system studied effectively integrated text clustering, text summarization, and text ranking and organized MEDLINE retrieval results into different topical groups.
The growing numbers of topically relevant biomedical publications readily available due to advances in document retrieval methods pose a challenge to clinicians practicing evidence-based medicine. It is increasingly time consuming to acquire and critically appraise the available evidence. This problem could be addressed in part if methods were available to automatically recognize rigorous studies immediately applicable in a specific clinical situation. We approach the problem of recognizing studies containing useable clinical advice from retrieved topically relevant articles as a binary classification problem. The gold standard used in the development of PubMed clinical query filters forms the basis of our approach. We identify scientifically rigorous studies using supervised machine learning techniques (Naïve Bayes, support vector machine (SVM), and boosting) trained on high-level semantic features. We combine these methods using an ensemble learning method (stacking). The performance of learning methods is evaluated using precision, recall and F1 score, in addition to area under the receiver operating characteristic (ROC) curve (AUC). Using a training set of 10,000 manually annotated MEDLINE citations, and a test set of an additional 2,000 citations, we achieve 73.7% precision and 61.5% recall in identifying rigorous, clinically relevant studies, with stacking over five feature-classifier combinations and 82.5% precision and 84.3% recall in recognizing rigorous studies with treatment focus using stacking over word + metadata feature vector. Our results demonstrate that a high quality gold standard and advanced classification methods can help clinicians acquire best evidence from the medical literature.
The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine.
We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%.
This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address .
The process of constructing a systematic review, a document that compiles the published evidence pertaining to a specified medical topic, is intensely time-consuming, often taking a team of researchers over a year, with the identification of relevant published research comprising a substantial portion of the effort. The standard paradigm for this information-seeking task is to use Boolean search; however, this leaves the user(s) the requirement of examining every returned result. Further, our experience is that effective Boolean queries for this specific task are extremely difficult to formulate and typically require multiple iterations of refinement before being finalized.
We explore the effectiveness of using ranked retrieval as compared to Boolean querying for the purpose of constructing a systematic review. We conduct a series of experiments involving ranked retrieval, using queries defined methodologically, in an effort to understand the practicalities of incorporating ranked retrieval into the systematic search task.
Our results show that ranked retrieval by itself is not viable for this search task requiring high recall. However, we describe a refinement of the standard Boolean search process and show that ranking within a Boolean result set can improve the overall search performance by providing early indication of the quality of the results, thereby speeding up the iterative query-refinement process.
Outcomes of experiments suggest that an interactive query-development process using a hybrid ranked and Boolean retrieval system has the potential for significant time-savings over the current search process in the systematic reviewing.
OBJECTIVE: Assess the performance of the SAPHIRE automated information retrieval system. DESIGN: Comparative study of automated and human searching of a MEDLINE test collection. MEASUREMENTS: Recall and precision of SAPHIRE were compared with those attributes of novice physicians, expert physicians, and librarians for a test collection of 75 queries and 2,334 citations. Failure analysis assessed the efficacy of the Metathesaurus as a concept vocabulary; the reasons for retrieval of nonrelevant articles and nonretrieval of relevant articles; and the effect of changing the weighting formula for relevance ranking of retrieved articles. RESULTS: Recall and precision of SAPHIRE were comparable to those of both physician groups, but less than those of librarians. CONCLUSION: The current version of the Metathesaurus, as utilized by SAPHIRE, was unable to represent the conceptual content of one-fourth of physician-generated MEDLINE queries. The most likely cause for retrieval of nonrelevant articles was the presence of some or all of the search terms in the article, with frequencies high enough to lead to retrieval. The most likely cause for nonretrieval of relevant articles was the absence of the actual terms from the query, with synonyms or hierarchically related terms present instead. There were significant variations in performance when SAPHIRE's concept-weighing formulas were modified.
Secondary use of electronic health record (EHR) data relies on the ability to retrieve accurate and complete information about desired patient populations. The Text Retrieval Conference (TREC) 2011 Medical Records Track was a challenge evaluation allowing comparison of systems and algorithms to retrieve patients eligible for clinical studies from a corpus of de-identified medical records, grouped by patient visit. Participants retrieved cohorts of patients relevant to 35 different clinical topics, and visits were judged for relevance to each topic. This study identified the most common barriers to identifying specific clinic populations in the test collection.
Using the runs from track participants and judged visits, we analyzed the five non-relevant visits most often retrieved and the five relevant visits most often overlooked. Categories were developed iteratively to group the reasons for incorrect retrieval for each of the 35 topics.
Reasons fell into nine categories for non-relevant visits and five categories for relevant visits. Non-relevant visits were most often retrieved because they contained a non-relevant reference to the topic terms. Relevant visits were most often infrequently retrieved because they used a synonym for a topic term.
This failure analysis provides insight into areas for future improvement in EHR-based retrieval with techniques such as more widespread and complete use of standardized terminology in retrieval and data entry systems.
Physicians ask many complex questions during the patient encounter. Information retrieval systems that can provide immediate and relevant answers to these questions can be invaluable aids to the practice of evidence-based medicine. In this study, we first automatically identify topic keywords from ad hoc clinical questions with a Condition Random Field model that is trained over thousands of manually annotated clinical questions. We then report on a linear model that assigns query weights based on their automatically identified semantic roles: topic keywords, domain specific terms, and their synonyms. Our evaluation shows that this weighted keyword model improves information retrieval from the Text Retrieval Conference Genomics track data.
The KB-Rank tool was developed to help determine the functions of proteins. A user provides text query and protein structures are retrieved together with their functional annotation categories. Structures and annotation categories are ranked according to their estimated relevance to the queried text. The algorithm for ranking first retrieves matches between the query text and the text fields associated with the structures. The structures are next ordered by their relative content of annotations that are found to be prevalent across all the structures retrieved. An interactive web interface was implemented to navigate and interpret the relevance of the structures and annotation categories retrieved by a given search. The aim of the KB-Rank tool is to provide a means to quickly identify protein structures of interest and the annotations most relevant to the queries posed by a user. Informational and navigational searches regarding disease topics are described to illustrate the tool’s utilities. The tool is available at the URL http://protein.tcmedc.org/KB-Rank.
Protein structural chain; Text query; Relevance ranking; Function; Disease
Because of the increasing number of electronic resources, designing efficient tools to retrieve and exploit them is a major challenge. Some improvements have been offered by semantic Web technologies and applications based on domain ontologies. In life science, for instance, the Gene Ontology is widely exploited in genomic applications and the Medical Subject Headings is the basis of biomedical publications indexation and information retrieval process proposed by PubMed. However current search engines suffer from two main drawbacks: there is limited user interaction with the list of retrieved resources and no explanation for their adequacy to the query is provided. Users may thus be confused by the selection and have no idea on how to adapt their queries so that the results match their expectations.
This paper describes an information retrieval system that relies on domain ontology to widen the set of relevant documents that is retrieved and that uses a graphical rendering of query results to favor user interactions. Semantic proximities between ontology concepts and aggregating models are used to assess documents adequacy with respect to a query. The selection of documents is displayed in a semantic map to provide graphical indications that make explicit to what extent they match the user's query; this man/machine interface favors a more interactive and iterative exploration of data corpus, by facilitating query concepts weighting and visual explanation. We illustrate the benefit of using this information retrieval system on two case studies one of which aiming at collecting human genes related to transcription factors involved in hemopoiesis pathway.
The ontology based information retrieval system described in this paper (OBIRS) is freely available at: http://www.ontotoolkit.mines-ales.fr/ObirsClient/. This environment is a first step towards a user centred application in which the system enlightens relevant information to provide decision help.
Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains.
MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92.
MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at .
Existing biological databases support a variety of queries such as keyword or definition search. However, they do not provide any measure of relevance for the instances reported, and result sets are usually sorted arbitrarily.
We describe a system that builds upon the complex infrastructure of the Biozon database and applies methods similar to those of Google to rank documents that match queries. We explore different prominence models and study the spectral properties of the corresponding data graphs. We evaluate the information content of principal and non-principal eigenspaces, and test various scoring functions which combine contributions from multiple eigenspaces. We also test the effect of similarity data and other variations which are unique to the biological knowledge domain on the quality of the results. Query result sets are assessed using a probabilistic approach that measures the significance of coherence between directly connected nodes in the data graph. This model allows us, for the first time, to compare different prominence models quantitatively and effectively and to observe unique trends.
Our tests show that the ranked query results outperform unsorted results with respect to our significance measure and the top ranked entities are typically linked to many other biological entities. Our study resulted in a working ranking system of biological entities that was integrated into Biozon at .
Develop and analyze results from an image retrieval test collection.
After participating research groups obtained and assessed results from their systems in the image retrieval task of Cross-Language Evaluation Forum, we assessed the results for common themes and trends. In addition to overall performance, results were analyzed on the basis of topic categories (those most amenable to visual, textual, or mixed approaches) and run categories (those employing queries entered by automated or manual means as well as those using visual, textual, or mixed indexing and retrieval methods). We also assessed results on the different topics and compared the impact of duplicate relevance judgments.
A total of 13 research groups participated. Analysis was limited to the best run submitted by each group in each run category. The best results were obtained by systems that combined visual and textual methods. There was substantial variation in performance across topics. Systems employing textual methods were more resilient to visually oriented topics than those using visual methods were to textually oriented topics. The primary performance measure of mean average precision (MAP) was not necessarily associated with other measures, including those possibly more pertinent to real users, such as precision at 10 or 30 images.
We developed a test collection amenable to assessing visual and textual methods for image retrieval. Future work must focus on how varying topic and run types affect retrieval performance. Users' studies also are necessary to determine the best measures for evaluating the efficacy of image retrieval systems.
MorphoSaurus, a concept-based document search engine, was incorporated into an EHR system in order to support search across the whole corpus of patient discharge letters and other clinically relevant documents. A user survey showed a general satisfaction with the system and revealed novel usages for information stored in discharge letters. The retrieval system was also used to identify relevant documents for a five-year retrospective survey of suspicious syphilis cases in the department. This retrieval scenario was used to assess the performance of MorphoSaurus against a manually created gold standard. A substring search for the German words “syphilis” and “lues” was used as baseline. The system yielded a precision p = 20.1% and a recall r = 100%. The values for the substring “syphilis” were p = 65.5% and r = 47.5%, for “lues” p = 15.4% and r = 87.7%. The results support the use of the proposed recall-oriented search across EHR documents to acquire valid and complete data for epidemiology studies in hospital populations.
Finding relevant articles from PubMed is challenging because it is hard to express the user's specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multi-level relevance feedback system for PubMed, called RefMed, which supports both ad-hoc keyword queries and a multi-level relevance feedback in real time on PubMed.
RefMed supports a multi-level relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multi-level relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed.
RefMed is the first multi-level relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user’s feedback and efficiently processes the function to return relevant articles in real time.
In the biomedical domain, there are immense data and tremendous increase of genomics and biomedical relevant publications. The wealth of information has led to an increasing amount of interest in and need for applying information retrieval techniques to access the scientific literature in genomics and related biomedical disciplines. In many cases, the desired information of a query asked by biologists is a list of a certain type of entities covering different aspects that are related to the question, such as cells, genes, diseases, proteins, mutations, etc. Hence, it is important of a biomedical IR system to be able to provide relevant and diverse answers to fulfill biologists' information needs. However traditional IR model only concerns with the relevance between retrieved documents and user query, but does not take redundancy between retrieved documents into account. This will lead to high redundancy and low diversity in the retrieval ranked lists.
In this paper, we propose an approach which employs a topic generative model called Latent Dirichlet Allocation (LDA) to promoting ranking diversity for biomedical information retrieval. Different from other approaches or models which consider aspects on word level, our approach assumes that aspects should be identified by the topics of retrieved documents. We present LDA model to discover topic distribution of retrieval passages and word distribution of each topic dimension, and then re-rank retrieval results with topic distribution similarity between passages based on N-size slide window. We perform our approach on TREC 2007 Genomics collection and two distinctive IR baseline runs, which can achieve 8% improvement over the highest Aspect MAP reported in TREC 2007 Genomics track.
The proposed method is the first study of adopting topic model to genomics information retrieval, and demonstrates its effectiveness in promoting ranking diversity as well as in improving relevance of ranked lists of genomics search. Moreover, we proposes a distance measure to quantify how much a passage can increase topical diversity by considering both topical importance and topical coefficient by LDA, and the distance measure is a modified Euclidean distance.
Motivation: Genome-wide measurement of transcript levels is an ubiquitous tool in biomedical research. As experimental data continues to be deposited in public databases, it is becoming important to develop search engines that enable the retrieval of relevant studies given a query study. While retrieval systems based on meta-data already exist, data-driven approaches that retrieve studies based on similarities in the expression data itself have a greater potential of uncovering novel biological insights.
Results: We propose an information retrieval method based on differential expression. Our method deals with arbitrary experimental designs and performs competitively with alternative approaches, while making the search results interpretable in terms of differential expression patterns. We show that our model yields meaningful connections between biological conditions from different studies. Finally, we validate a previously unknown connection between malignant pleural mesothelioma and SIM2s suggested by our method, via real-time polymerase chain reaction in an independent set of mesothelioma samples.
Availability: Supplementary data and source code are available from http://www.ebi.ac.uk/fg/research/rex.
Supplementary Information: Supplementary data are available at Bioinformatics online.
This paper presents novel multiple keywords annotation for medical images, keyword-based medical image retrieval, and relevance feedback method for image retrieval for enhancing image retrieval performance. For semantic keyword annotation, this study proposes a novel medical image classification method combining local wavelet-based center symmetric–local binary patterns with random forests. For keyword-based image retrieval, our retrieval system use the confidence score that is assigned to each annotated keyword by combining probabilities of random forests with predefined body relation graph. To overcome the limitation of keyword-based image retrieval, we combine our image retrieval system with relevance feedback mechanism based on visual feature and pattern classifier. Compared with other annotation and relevance feedback algorithms, the proposed method shows both improved annotation performance and accurate retrieval results.
Image annotation; Random forests; Confidence score; Body relation graph; Relevance feedback
In the biomedical domain, the desired information of a question (query) asked by biologists usually is a list of a certain type of entities covering different aspects that are related to the question, such as genes, proteins, diseases, mutations, etc. Hence it is important for a biomedical information retrieval system to be able to provide comprehensive and diverse answers to fulfill biologists’ information needs. However, traditional retrieval models assume that the relevance of a document is independent of the relevance of other documents. This assumption may result in high redundancy and low diversity in the retrieval ranked lists.
In this paper, we propose a relevance-novelty combined model, named RelNov model, based on the framework of an undirected graphical model. It consists of two component models, namely the aspect-term relevance model and the aspect-term novelty model. They model the relevance of a document and the novelty of a document respectively. We show that our approach can achieve 16.4% improvement over the highest aspect level MAP reported in the TREC 2007 Genomics track, and 9.8% improvement over the highest passage level MAP reported in the TREC 2007 Genomics track.
The proposed combination model which models aspects, terms, topic relevance and document novelty as potential functions is demonstrated to be effective in promoting ranking diversity as well as in improving relevance of ranked lists for genomics search. We also show that the use of aspect plays an important role in the model. Moreover, the proposed model can integrate various different relevance and novelty measures easily.
Information overload is a significant problem for modern medicine. Searching
MEDLINE for common topics often retrieves more relevant documents
than users can review. Therefore, we must identify documents that are
not only relevant, but also important. Our system ranks articles using
citation counts and the PageRank algorithm, incorporating data from
the Science Citation Index. However, citation data is usually incomplete. Therefore, we
explore the relationship between the quantity of citation
information available to the system and the quality of the result
ranking. Specifically, we test the ability of citation count and PageRank
to identify “important articles” as defined by
experts from large result sets with decreasing citation information. We
found that PageRank performs better than simple citation counts, but
both algorithms are surprisingly robust to information loss. We conclude
that even an incomplete citation database is likely to be effective
for importance ranking.
PubMed is the main access to medical literature on the Internet. In order to enhance the performance of its information retrieval tools, primarily non-indexed citations, the authors propose a method: expanding users' queries using Unified Medical Language System' (UMLS) synonyms i.e. all the terms gathered under one unique Concept Unique Identifier.
This method was evaluated using queries constructed to emphasize the differences between this new method and the current PubMed automatic term mapping. Four experts assessed citation relevance.
Using UMLS, we were able to retrieve new citations in 45.5% of queries, which implies a small increase in recall. The new strategy led to a heterogeneous 23.7% mean increase in non-indexed citation retrieved. Of these, 82% have been published less than 4 months earlier. The overall mean precision was 48.4% but differed according to the evaluators, ranging from 36.7% to 88.1% (Inter rater agreement was poor: kappa = 0.34).
This study highlights the need for specific search tools for each type of user and use-cases. The proposed strategy may be useful to retrieve recent scientific advancement.
This paper proposes a set of web-based indicators for quantifying and ranking the relevance of terms related to key-issues in Ecology and Sustainability Science. Search engines that operate in different contexts (e.g. global, social, scientific) are considered as web information carriers (WICs) and are able to analyse; (i) relevance on different levels: global web, individual/personal sphere, on-line news, and culture/science; (ii) time trends of relevance; (iii) relevance of keywords for environmental governance. For the purposes of this study, several indicators and specific indices (relational indices and dynamic indices) were applied to a test-set of 24 keywords. Outputs consistently show that traditional study topics in environmental sciences such as water and air have remained the most quantitatively relevant keywords, while interest in systemic issues (i.e. ecosystem and landscape) has grown over the last 20 years. Nowadays, the relevance of new concepts such as resilience and ecosystem services is increasing, but the actual ability of these concepts to influence environmental governance needs to be further studied and understood. The proposed approach, which is based on intuitive and easily replicable procedures, can support the decision-making processes related to environmental governance.
We present a passage relevance model for integrating syntactic and semantic evidence of biomedical concepts and topics using a probabilistic graphical model. Component models of topics, concepts, terms, and document are represented as potential functions within a Markov Random Field. The probability of a passage being relevant to a biologist's information need is represented as the joint distribution across all potential functions. Relevance model feedback of top ranked passages is used to improve distributional estimates of query concepts and topics in context, and a dimensional indexing strategy is used for efficient aggregation of concept and term statistics. By integrating multiple sources of evidence including dependencies between topics, concepts, and terms, we seek to improve genomics literature passage retrieval precision. Using this model, we are able to demonstrate statistically significant improvements in retrieval precision using a large genomics literature corpus.
Plain language search tools for MEDLINE/PubMed are few. We wanted to develop a search tool that would allow anyone using a free-text, natural language query and without knowing specialized vocabularies that an expert searcher might use, to find relevant citations in MEDLINE/PubMed. This tool would translate a question into an efficient search.
The accuracy and relevance of retrieved citations were compared to references cited in BMJ POEMs and CATs (critically appraised topics) questions from the University of Michigan Department of Pediatrics. askMEDLINE correctly matched the cited references 75.8% in POEMs and 89.2 % in CATs questions on first pass. When articles that were deemed to be relevant to the clinical questions were included, the overall efficiency in retrieving journal articles was 96.8% (POEMs) and 96.3% (CATs.)
askMEDLINE might be a useful search tool for clinicians, researchers, and other information seekers interested in finding current evidence in MEDLINE/PubMed. The text-only format could be convenient for users with wireless handheld devices and those with low-bandwidth connections in remote locations.
Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings.
Our two baseline ranking strategies are quite similar in performance. Two of our three LocusLink-based strategies offer significant improvements. These methods work very well even when there is ambiguity in the gene terms. Our best ranking strategy offers significant improvements on three different kinds of ambiguities over our two baseline strategies (improvements range from 15.9% to 17.7% and 11.7% to 13.3% depending on the baseline). For most genes the best ranking query is one that is built from the LocusLink (now Entrez Gene) summary and product information along with the gene names and aliases. For others, the gene names and aliases suffice. We also present an approach that successfully predicts, for a given gene, which of these two ranking queries is more appropriate.
We explore the effect of different post-retrieval strategies on the ranking of documents returned by PubMed for human gene queries. We have successfully applied some of these strategies to improve the ranking of relevant documents in the retrieved sets. This holds true even when various kinds of ambiguity are encountered. We feel that it would be very useful to apply strategies like ours on PubMed search results as these are not ordered by relevance in any way. This is especially so for queries that retrieve a large number of documents.
OBJECTIVES: Assess query expansion using thesaurus relationships and definitions in the UMLS Metathesaurus for improving searching performance. METHODS: The queries from a MEDLINE test collection (OHSUMED) were expanded using synonym, hierarchical, and related term information as well as term definitions from the UMLS Metathesaurus. Documents were retrieved from a word-statistical retrieval system and assessed for recall and precision based on relevance judgments from the test collection. RESULTS: All types of query expansion degraded aggregate retrieval performance as measured by recall and precision, although 38.6% of the queries with synonym expansion and up to 29.7% of the queries with hierarchical expansion showed improvement. CONCLUSIONS: Thesaurus-based query expansion causes a decline in retrieval performance generally but improves it in specific instances. Further research must focus on identifying instances where performance improves and how it can be exploited by real users.