|Home | About | Journals | Submit | Contact Us | Français|
The Multi-source Integrated Platform for Answering Clinical Questions (MiPACQ) is a QA pipeline that integrates a variety of information retrieval and natural language processing systems into an extensible question answering system. We present the system’s architecture and an evaluation of MiPACQ on a human-annotated evaluation dataset based on the Medpedia health and medical encyclopedia. Compared with our baseline information retrieval system, the MiPACQ rule-based system demonstrates 84% improvement in Precision at One and the MiPACQ machine-learning-based system demonstrates 134% improvement. Other performance metrics including mean reciprocal rank and area under the precision/recall curves also showed significant improvement, validating the effectiveness of the MiPACQ design and implementation.
The increasing requirement to deliver higher quality care with increased speed and reduced cost has forced clinicians to adopt new approaches to how diagnoses are made and care is administered. One approach, evidence-based medicine, focuses on the integration of the current best clinical expertise and research into the ongoing practice of medicine 1.
One of the positive side effects of this requirement for clinicians to make better use of medical evidence with less time and effort has been practice-enhancing technological solutions. Medical citation databases such as PubMed have enabled immediate access to a wide array of research over the Internet. In many cases, medical information retrieval systems allow clinicians to quickly locate articles that are specific to their information need. However, given that medical information retrieval systems generally return results only at a coarse level, and given that clinicians have a minimal amount of time to search for answers – often as little as two minutes – there is substantial room for improvement 2. The adoption of natural language processing (NLP) techniques allows for the creation of clinical question answering systems that better understand the clinician’s query and can more precisely serve the user’s information need. Such systems typically accept clinical questions in free-text and produce a fine-grained list of potential answers to the question. The result is an improvement in retrieval performance and a reduction in the effort required by the clinician.
The Multi-source Integrated Platform for Answering Clinical Questions (MiPACQ) is an integrated framework for semantic-based question processing and information extraction. Using NLP and information retrieval (IR) techniques, MiPACQ accepts free-text clinical questions and presents the user with succinct answers from a variety of sources 3 such as general medical encyclopedic resources and the patient’s data residing in the Electronic Medical Record (EMR). We present a design and implementation overview of the MiPACQ system along with an evaluation dataset that includes clinical questions and human-annotated answers taken from the Clinques 4 question set and Medpedia 5 online medical encyclopedia. We demonstrate that the MiPACQ system shows significant improvement over the existing Medpedia search engine and a Lucene 6 baseline information retrieval system on standard question answering evaluation metrics such as Precision at One and mean reciprocal rank (MRR).
The use of readily-available Internet search engines such as Google has become commonplace in clinical contexts: Schwartz et. al 7 surveyed emergency medicine residents and found that they considered Internet searches to be highly reliable and routinely used them to answer clinical questions in the emergency department. Unfortunately, the same study found that the residents had difficulty distinguishing between reliable and unreliable answers. Worse, while residents in the study expressed a high degree of confidence in the reliability of answers found through Internet searches, when tested the residents provided incorrect answers 33% of the time. When Athenikos and Han 8 surveyed biomedical QA systems, they asserted that “current medical QA approaches have limitations in terms of the types and formats of questions they can process.”
Some projects have attacked the question answering problem using syntactic information and text-based analysis. Yu et. al. created such a system 9 based on Google definitions of terms from the UMLS 10. The system can provide summarized answers to definitional questions and provides context by citing relevant MEDLINE articles. In subjective evaluations with physicians, Yu’s system did well with a variety of satisfaction metrics; however only definitional questions were tested.
There are a number of clinical question answering systems that make use of semantic information in clinical documents. Demner-Fushman et. al. 11 developed a system of feature extractors that identify subject populations, interventions, and outcomes in medical studies. Demner-Fushman et. al. later extended this research to create a general-purpose QA system 12. This system is similar in principle to the MiPACQ architecture, but smaller in scope as it retrieves answers only at the document level and targets a more restricted answer corpus (MiPACQ returns answers at the paragraph level and targets arbitrary information sources). Wang et. al. 13 experimented with UMLS semantic relationships and created a system that extracts exact answers from the output of a Lucene-based information retrieval system; however, their approach requires an exact semantic match, which limits performance on more complex and non-factoid questions. Athenikos et. al 14 propose a rule-based system based on human-developed question/answer UMLS semantic patterns; such an approach is likely to perform well on specific question types but might have difficulty generalizing to new questions. AskHermes15 finds answers based on keyword searching.
The MiPACQ system is distinguished from existing clinical QA systems in several ways. MiPACQ is broader in scope because it integrates numerous NLP components to enable deeper semantic understanding of medical questions and resources and it is designed to allow integration with a wide range of information sources and NLP systems. Additionally, unlike most medical QA research, MiPACQ was designed from the start to provide an extensible framework for the integration of new system components. Finally, MiPACQ uses machine learning (ML) based re-ranking rather than the fixed rule-based re-ranking used in most other research.
The MiPACQ system integrates multiple NLP tools into a single system that provides query formulation, automatic question and candidate answer annotation, answer re-ranking, and training for the various components. The system is based on ClearTK 16, a UIMA-based 17 toolkit developed by the computational semantics research group 18 at the University of Colorado at Boulder. ClearTK provides common data formats and interfaces for many popular NLP and ML systems.
Figure 1 provides an overview of the MiPACQ architecture (for further details see Nielsen et al., 2010 3). Questions are submitted by a clinician or investigator using the web-based user interface. The question is then processed by the annotation pipeline which adds semantic annotations. MiPACQ then interfaces with the information retrieval (IR) system which uses term overlap to retrieve candidate answer paragraphs. Candidate answer paragraphs are annotated using the annotation pipeline. Finally, the paragraphs are re-ordered by the answer re-ranking system based on the semantic annotations, and the results are presented to the user.
Because the MiPACQ system is designed to be used with a wide variety of medical resources, we wanted to test its performance against a medical article database with a broad focus. Medpedia is a Wiki-style collaborative medical database that targets both clinicians and the general public. Unlike some other Wiki projects (such as Wikipedia), Medpedia restricts editing to approved physicians and doctoral-degreed biomedical scientists 19. Despite this restricted authorship, Medpedia’s volunteer-based collaborative editing model results in a broader range of writing styles and article quality than traditional medical databases or journals. This variability provides a good test of the MiPACQ system’s ability to generalize to noisy and diverse data. Additionally, because Medpedia’s text is freely available under a Creative Commons license, we will be able to distribute our full-text index, annotation guidelines, and annotations to other researchers late in 2011.
Because Medpedia is a continually evolving corpus, we took a snapshot of Medpedia’s full text on April 26th, 2010. This snapshot has served as our Medpedia corpus throughout the MiPACQ project. Articles were broken into paragraphs based on HTML “p” tag boundaries and all other HTML tags (including formatting, images, and hyperlinks were stripped).
The clinical question data is a set of full text questions and Medpedia article and paragraph relevance judgments. We started with 4654 clinical questions from the Clinques corpus collected by Ely et. al. from interviews with physicians 4. During the interviews, which were conducted between patient visits, the physicians were asked what non-patient-specific questions came up during the preceding visit, including “vague fleeting uncertainties”.
Given the variability in quality in the Clinques corpus and our annotation resources, we narrowed the question set to 353 questions that were likely to be answered by the Medpedia medical encyclopedia. The exclusions consisted of questions that were incomprehensible or incomplete, required temporal logic to answer, required qualitative judgments, or required patient-specific knowledge that was not present in Medpedia 5. The remaining questions were then randomly divided into training and evaluation sets. A medical information retrieval and annotation specialist manually formulated and ran queries against the Medpedia corpus and annotated all of the paragraphs in articles that appeared relevant and in a few of the other top ranked articles.
The human annotator found individual paragraphs in Medpedia that answered 177 of the questions. We discarded the remaining questions because Medpedia lacked the answer content. The training set contains 80 questions with at least one paragraph answering the question, and the evaluation set contains 81 questions with at least one answer paragraph. The annotated questions will also be released in late 2011.
The main NLP component of the MiPACQ question answering system is the annotation pipeline. The pipeline incorporates multiple rule-based and ML-based systems that build on each other to produce question and answer annotations that facilitate re-ranking candidate answers based on semantics (e.g., raising the rank of candidates containing the expected answer type or that include similar UMLS entities). Although the pipeline stages are the same for questions and candidate answers (excluding the expected answer type system, which operates only on questions), all of the ML-based pipeline components use separate models for questions and answers.
The first stage in the annotation pipeline processes the question (or candidate answer) with the cTAKES clinical text analysis system. cTAKES is a multi-function system 20 that incorporates sentence detection, tokenization, stemming, part of speech detection, named entity recognition, and dependency parsing. Close collaboration between the MiPACQ project and the cTAKES project resulted in the addition of several features to cTAKES that are directly useful in the MiPACQ system. Because MiPACQ builds upon the unified type system used throughout cTAKES, annotations added by the cTAKES components (such as part-of-speech tags) are readily utilized by later stages in the MiPACQ annotation pipeline.
cTAKES is designed as a pipeline of connected systems that work together to create useful textual annotations 21. The pipeline first locates sentences in the document or question using the OpenNLP 22 sentence detector. Individual sentences are then annotated with token boundaries with a custom rule based tokenizer that handles details such as punctuation and hyphenated words. The next pipeline stage wraps the National Library of Medicine (NLM) SPECIALIST tools 23, providing term canonicalization and lemmatization.
The cTAKES pipeline then performs part-of-speech tagging using a custom UIMA wrapper around the OpenNLP part of speech tagger. The wrapper adapts OpenNLP’s data format to match the cTAKES unified type system and provides functionality for building the tag dictionary and tagging model. cTAKES provides a pre-generated standard tagging model trained on a combination of Penn Treebank 24 and clinical data.
Syntactic parsing is provided in cTAKES using the CLEAR dependency parser 25, a state-of-the-art transition based parser developed by the CLEAR computational semantics group at the University of Colorado at Boulder.
Named entity detection in cTAKES is done using a series of dictionary lookup annotators, which identify drug mentions, anatomical sites, procedures, signs/symptoms, and diseases/disorders using a series of dictionaries based on the UMLS Metathesaurus. Taken together, these annotators provide good coverage for a variety of medical named entity types.
The final stage in the cTAKES pipeline annotates the discovered mentions with negations (e.g. “no chest pain”) and status information (such as “history of myocardial infarction”). These annotations are generated using the finite state machine based NegEx algorithm 26.
The expected answer type system annotates questions with the UMLS entity type or types of the expected answer. We anticipate that paragraphs containing UMLS entities of the expected answer type will be more likely to be correct answers. The expected answer type system is a ML-based classification system built using ClearTK and the SVM-Light training system. Our expected answer type system uses bag-of-words and bigram data to generate feature vectors for each question. We use a series of one-vs.-many binary classifiers to implement multi-way classification. The output of the system is a UMLS entity type category that is expected to match the key UMLS entity contained in the correct answer. For example, in the question “How do you diagnose reactive hypoglycemia?” the expected answer type is “Diagnostic Procedure”.
Like many other Wiki sites, Medpedia has built-in full-text search capabilities. Medpedia is based on the MediaWiki content engine and utilizes the MediaWiki search engine, which is based on MySQL’s underlying full-text search index. Search in Medpedia is done at the article (document) level and is therefore directly comparable to the MiPACQ document-level baseline system described in the next section.
The Medpedia search data was obtained using HTTP from the live Medpedia server on November 24, 2010 by running the default Medpedia search engine on the unmodified clinical questions and collecting the top 50 results (or fewer, when Medpedia did not return at least 50 results). These articles were filtered to include just those that were present in the original April 26th download. The results were recorded in a database for later analysis.
Document-level baselines were developed just as a means of ensuring that our initial article filter provided reasonable results. The emphasis of this work is on performing paragraph-level question answering.
Because of the deficiencies in Medpedia’s default search, we decided to create a baseline information retrieval system that operates at the document level. The document-level baseline is based on the Lucene full-text search index, which by default (and in our application) uses the vector space model 27 with normalized tf-idf parameters to rank the documents being queried.
The document-level index was built from Medpedia articles as represented in the MiPACQ database. The entire document is indexed as a single text string. Each indexed document also includes the article title and a unique document ID. There was no specialized processing performed to improve results based on any specific structure or characteristics of Medpedia; hence, the final QA results should transfer without significant differences to other medical resources with similar content.
As is common with information retrieval systems, the MiPACQ document-level baseline makes extensive use of stemming and stop-listing to improve both the precision and recall of the results. The question and document were processed with the Snowball 28 stemmer. Stop-listing was initially limited to Lucene’s StandardAnalyzer, but we were later able to improve performance by switching to a domain-specific stop list generated by removing medical terms from the 50 most frequent words in the Medpedia article corpus. The final stop list contained the following words: to, they, but, other, can, for, no, by, been, has, who, was, of, were, are, if, when, on, do, these, be, may, with, is, it, such, how, or, a, at, into, as, you, the, should, in, and, not, that, which, an, then, there, will, their, this.
To improve recall, the document-level baseline uses OR as the default conjunction operator. Although OR has a detrimental effect on precision, the fact that the input to the system is full questions means that a substantial fraction of the test questions return no results whatsoever with AND as the default operator, even with stemming and stop-listing.
As with the Medpedia search baseline, the MiPACQ document-level system is configured to return only the top 50 results. Results beyond this point have minimal effect on the evaluation metrics and in practice are rarely reviewed by a user.
Similar to the document-level baseline system, the MiPACQ paragraph-level baseline system is a traditional information retrieval system based on Lucene. The paragraph-level baseline system uses two indices: the document index described in the previous section and a paragraph index, which uses the same Snowball stemming and stop list.
Question answering in the paragraph-level baseline system requires two steps. First, the top n (currently, 20) documents are retrieved using the document-level index. The paragraph index is then queried, but only paragraphs in the previous document set are considered. The final paragraph scores are the product of the document-level scores and the paragraph scores. This composite score was found to be more reliable in training data experiments than simply querying the paragraph index directly because it is more sensitive to context from the document. To produce the final answer list, the paragraph-level system ranks the candidate answer paragraphs by composite score in descending order.
As with the document-level baseline system, the paragraph-level baseline makes no use of the annotation pipeline. Answer ranking is dependent entirely on the overlap between the set of question tokens and the set of tokens in the candidate answer paragraph and document.
The rule-based re-ranking system first uses the paragraph-level baseline system to retrieve the top n (currently 100) candidate answer paragraphs. The question and the top answer paragraphs are then annotated using the MiPACQ annotation pipeline. Candidate answer paragraphs are then re-ranked (re-ordered) using a fixed formula based, in part, on the semantic annotations produced by the annotation pipeline. This method is used as a baseline to demonstrate performance based on a few simple informative features.
There are three components to the rule-based scoring function as described in Equation 1. The first component, S, is the original score from the paragraph-level baseline system (which is itself the product of document- and paragraph-level scores from Lucene). This score is then multiplied by the sum of two other components: a bag-of-words component and a UMLS entity component.
The UMLS entity component compares the UMLS entity types in the question and answer paragraph. The number of matching UMLS features (AQ ∩ AA) is divided by the number of UMLS entity annotations found in the question (AQ). Add-one smoothing is used to prevent division by zero and scores of zero. For questions with no UMLS entities (and by extension no possible UMLS entity intersections), the smoothing results in a UMLS entity component of 1 for every candidate answer paragraph.
The bag-of-words component is structured similarly to the UMLS component, but it considers matches between individual word tokens (WQ ∩ WA) rather than UMLS entities. Although the Lucene scoring algorithm used by the baseline system already examines word overlap between the question and answer, based on our experience with the training set we determined that adding the bag-of-words component (with uniform versus tf-idf term weighting, no stop words, and no stemming) improved overall QA performance slightly.
Increasingly, information retrieval and question answering systems are turning to machine-learning based ranking systems (sometimes called “learning-to-rank”). The most common approaches are point-wise methods (which use supervised learning to predict the absolute scores of individual results), pair-wise methods (which attempt to classify pairs of documents as correctly or incorrectly ordered), and list-wise methods (which consider the entire rank list as a whole) 29. However, all of these methods are excessive for question answering tasks because QA systems prioritize ranking at least one answer highly rather than ranking all of the documents in the correct order. As a result, our ML based re-ranking system uses a method similar to that described by Moschitti and Quarteroni 30 in which question/answer pairs are classified as “valid” or “invalid”.
As with the rule-based QA system, the ML-based system first uses the paragraph baseline system to retrieve the top 100 paragraphs for the question, then the question and each answer paragraph are run through the annotation system. The annotated questions and answers are then used to generate feature vectors for training and classification.
For each UMLS entity in the question and each UMLS entity in the candidate answer paragraph, the system generates a feature in the vector equal to the frequency of that entity (typically 0 or 1, although some questions and answers contain more than one UMLS entity of the same type). The system also generates features for the token frequencies in the question and answer, for the original baseline score and for the expected answer type.
For each question/paragraph pair in the training dataset (for the top 100 paragraphs, as noted above, for each question), a training instance is generated using the computed feature vector and a binary value indicating whether the question/paragraph pair is valid (the gold standard annotation indicates that the paragraph answers the question). The SVM-Light training system is then used to create a corresponding binary classifier.
During the QA process, the re-ranking system multiplies the baseline scores by the answer probability from the answer classifier. This probability is obtained by fitting a sigmoid function to the classifier output (over the training dataset) using an improved variant of Platt calibration 31 as described by Lin, Lin, and Weng 32. In experiments with the training dataset, we found that this produced better results than ranking all paragraphs classified as “answer” above those classified as “not answer”.
The three evaluation metrics we selected were average Precision at One (P@1), mean reciprocal rank (MRR), and area under the precision/recall curve (AUC). Average Precision at One is the fraction of the questions where a correct answer was returned as the first result. The formula for MRR is presented in Equation 2, where Q represents the set of questions being evaluated, and top_rank(Qi) represents the index of the most highly ranked answer for the question being evaluated.
Precision at One and MRR emphasize ranking at least one answer highly, whereas obtaining high AUC depends on the system highly ranking all correct answers. The AUC values were generated by finding the area under a step-function interpolation of the precision/recall curve. In the interpolation, the precision at a given recall is the maximum precision with the same or better recall. All of the presented results are averages across all of the evaluation set questions.
The document-level systems (Medpedia search and the MiPACQ document-level baseline system) were evaluated against the 81 questions in the evaluation set. For the purpose of this evaluation, documents containing at least one paragraph annotated “answer” were considered valid answer documents.
Given that both the MiPACQ and Medpedia systems are based on shallow textual features, the performance difference is dramatic (Precision at One increased by 302%). Despite the fact that Lucene is not intended to be a question answering system, the document-level results are acceptable as a first step in the question answering process. This is in contrast to the Medpedia search system: while it might be acceptable as a search system, as a question answering system it fails to locate relevant documents at an acceptable rate.
The paragraph-level QA system results are presented in Table 2. These results were generated by running the evaluation system against the same question test set as in the document-level evaluation.
Several interesting facts emerge from the data. Although the document-level and paragraph-level results are not directly comparable, the paragraph-level results indicate (as would be expected) that finding the answer within the document is considerably more difficult than simply finding the relevant document. Whereas the Lucene-based MiPACQ document-level baseline system might be considered acceptable as a general-purpose QA system (an “answer” result was ranked first for 49% of the questions), the paragraph-level comparisons show that traditional information retrieval systems are inadequate for the more fine-grained paragraph level question answering task (only 7% of the questions resulted in an answer paragraph being returned as the first result).
Additionally, we can see that even the simple rule-based re-ranking algorithm using UMLS and BOW features provides a substantial performance boost. The rule-based system returned a correct answer in the first position 13.6% of the time (an 84% improvement relative to the baseline system). Because the rule-based system considers only UMLS entity matches and word matches between the question and answer, we can infer that these matches are strong predictors of correct answers. Additionally, despite the fact that the cTAKES system was not trained on the Medpedia corpus (at the time of this writing, we had not yet created gold-standard syntactic or semantic annotations for the Medpedia paragraphs) the generated UMLS entities were sufficiently valid to result in a significant performance improvement, providing additional evidence that the system should work well with new medical resources.
The ML-based re-ranking results showed further improvement over the rule-based system. The ML based system did significantly better at ranking answers in the first position (P@1): a correct answer was ranked highest 17.3% of the time (134% better than the IR baseline system and 27% better than the rule-based system). ML-based re-ranking also showed improvement on MRR compared to both the IR baseline (90% better) and the rule-based system (26% better). AUC showed significant improvement relative to the IR baseline, but was slightly lower than in the rule-based approach.
For 21 questions (26% of the evaluation set) the paragraph-level baseline system returned no answer paragraphs in the top 100. The re-ranking systems (both rule and ML based) cannot improve performance on these questions because paragraphs outside of the top 100 are not currently considered for re-ranking. We expect that adding keyword expansion will help to mitigate this problem by improving the baseline IR system’s recall. The accompanying precision loss should be mitigated by the filtering effect of the ML based re-ranking.
To improve document/paragraph recall, one strategy is term expansion. Each question is expanded into a series of queries with different versions of key terms. A standard information retrieval system is used to query the corpus with each query, and then the ranked results are combined into a single result set. We plan to implement term expansion using the LexEVS 33 terminology server and the UMLS entity features provided by the cTAKES annotator.
Moschitti and Quarteroni demonstrated that semantic role labeling (SRL) of questions and answers with PropBank-style predicate argument structures 34 can provide significant improvements in question answering performance 28. Predicate-argument annotations describe the semantic meanings of predicates and identify their arguments and argument roles. To investigate the effectiveness of adding predicate argument structures, we created a custom SRL system trained on a mixture of the Clinques corpus and PropBank data 34. Evaluation of our SRL system with the CoNLL-2009 shared task development dataset 35 indicates that our system is competitive with other state-of-the-art SRL systems.
Features extracted from UMLS semantic relations may also improve the answer re-ranking and we are completing a system to perform this classification. Full evaluation of the system will be completed once we have integrated the predicate argument structures and UMLS relations into the re-ranking systems.
MiPACQ will be released under an Apache v2 open-source license in late 2011. Associated lexical resources including gold standard annotations will also be made available to the research community.
The increased desire to incorporate medical databases into the clinical practice of medicine has resulted in increased demand for clinical question answering systems. Traditional information retrieval systems have proven inadequate for the task. While baseline systems performed marginally well at retrieving documents pertinent to the question, physicians do not have time to scour the entire document to find the answer and performance was substantially worse when we attempted to apply IR technologies to obtain answers at the paragraph level.
This is the first evaluation of MiPACQ. It demonstrates that the application of NLP and machine learning to clinical question answering can provide substantial performance improvements. By annotating both questions and candidate answers with a variety of syntactic and semantic features and incorporating both existing and new NLP systems into a comprehensive annotation and re-ranking pipeline, we were able to improve question answering performance significantly (an improvement of 133% in Precision at One relative to the baseline IR system). This is particularly significant considering that we have not yet incorporated query re-formulation (key word expansion). When MiPACQ is run against other corpora, the extent to which the re-ranking is boosted by the current machine-learning classifiers remains to be seen. And while the expected answer type classifier is dependent upon the training data, the classifier is only one component of the re-ranking. If needed, they can be retrained easily due to our use of the ClearTK framework. Our approach demonstrates the effectiveness of NLP for clinical question answering tasks as well as the utility of integrating multiple layers of annotation with machine-learning based re-ranking systems.
The project described was supported by award number NLM RC1LM010608. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NLM/NIH.