Like many other Wiki sites, Medpedia has built-in full-text search capabilities. Medpedia is based on the MediaWiki content engine and utilizes the MediaWiki search engine, which is based on MySQL’s underlying full-text search index. Search in Medpedia is done at the article (document) level and is therefore directly comparable to the MiPACQ document-level baseline system described in the next section.
The Medpedia search data was obtained using HTTP from the live Medpedia server on November 24, 2010 by running the default Medpedia search engine on the unmodified clinical questions and collecting the top 50 results (or fewer, when Medpedia did not return at least 50 results). These articles were filtered to include just those that were present in the original April 26th download. The results were recorded in a database for later analysis.
Document-level baselines were developed just as a means of ensuring that our initial article filter provided reasonable results. The emphasis of this work is on performing paragraph-level question answering.
MiPACQ Document-Level Baseline
Because of the deficiencies in Medpedia’s default search, we decided to create a baseline information retrieval system that operates at the document level. The document-level baseline is based on the Lucene full-text search index, which by default (and in our application) uses the vector space model 27
with normalized tf-idf parameters to rank the documents being queried.
The document-level index was built from Medpedia articles as represented in the MiPACQ database. The entire document is indexed as a single text string. Each indexed document also includes the article title and a unique document ID. There was no specialized processing performed to improve results based on any specific structure or characteristics of Medpedia; hence, the final QA results should transfer without significant differences to other medical resources with similar content.
As is common with information retrieval systems, the MiPACQ document-level baseline makes extensive use of stemming and stop-listing to improve both the precision and recall of the results. The question and document were processed with the Snowball 28
stemmer. Stop-listing was initially limited to Lucene’s StandardAnalyzer, but we were later able to improve performance by switching to a domain-specific stop list generated by removing medical terms from the 50 most frequent words in the Medpedia article corpus. The final stop list contained the following words: to, they, but, other, can, for, no, by, been, has, who, was, of, were, are, if, when, on, do, these, be, may, with, is, it, such, how, or, a, at, into, as, you, the, should, in, and, not, that, which, an, then, there, will, their, this.
To improve recall, the document-level baseline uses OR as the default conjunction operator. Although OR has a detrimental effect on precision, the fact that the input to the system is full questions means that a substantial fraction of the test questions return no results whatsoever with AND as the default operator, even with stemming and stop-listing.
As with the Medpedia search baseline, the MiPACQ document-level system is configured to return only the top 50 results. Results beyond this point have minimal effect on the evaluation metrics and in practice are rarely reviewed by a user.
MiPACQ Paragraph-Level Baseline
Similar to the document-level baseline system, the MiPACQ paragraph-level baseline system is a traditional information retrieval system based on Lucene. The paragraph-level baseline system uses two indices: the document index described in the previous section and a paragraph index, which uses the same Snowball stemming and stop list.
Question answering in the paragraph-level baseline system requires two steps. First, the top n (currently, 20) documents are retrieved using the document-level index. The paragraph index is then queried, but only paragraphs in the previous document set are considered. The final paragraph scores are the product of the document-level scores and the paragraph scores. This composite score was found to be more reliable in training data experiments than simply querying the paragraph index directly because it is more sensitive to context from the document. To produce the final answer list, the paragraph-level system ranks the candidate answer paragraphs by composite score in descending order.
As with the document-level baseline system, the paragraph-level baseline makes no use of the annotation pipeline. Answer ranking is dependent entirely on the overlap between the set of question tokens and the set of tokens in the candidate answer paragraph and document.
MiPACQ Rule-Based Re-ranking
The rule-based re-ranking system first uses the paragraph-level baseline system to retrieve the top n (currently 100) candidate answer paragraphs. The question and the top answer paragraphs are then annotated using the MiPACQ annotation pipeline. Candidate answer paragraphs are then re-ranked (re-ordered) using a fixed formula based, in part, on the semantic annotations produced by the annotation pipeline. This method is used as a baseline to demonstrate performance based on a few simple informative features.
There are three components to the rule-based scoring function as described in Equation 1
. The first component, S
, is the original score from the paragraph-level baseline system (which is itself the product of document- and paragraph-level scores from Lucene). This score is then multiplied by the sum of two other components: a bag-of-words component and a UMLS entity component.
Equation 1: Rule Based Scoring Function
The UMLS entity component compares the UMLS entity types in the question and answer paragraph. The number of matching UMLS features (AQ ∩ AA) is divided by the number of UMLS entity annotations found in the question (AQ). Add-one smoothing is used to prevent division by zero and scores of zero. For questions with no UMLS entities (and by extension no possible UMLS entity intersections), the smoothing results in a UMLS entity component of 1 for every candidate answer paragraph.
The bag-of-words component is structured similarly to the UMLS component, but it considers matches between individual word tokens (WQ ∩ WA) rather than UMLS entities. Although the Lucene scoring algorithm used by the baseline system already examines word overlap between the question and answer, based on our experience with the training set we determined that adding the bag-of-words component (with uniform versus tf-idf term weighting, no stop words, and no stemming) improved overall QA performance slightly.
MiPACQ ML-Based Re-ranking
Increasingly, information retrieval and question answering systems are turning to machine-learning based ranking systems (sometimes called “learning-to-rank”). The most common approaches are point-wise methods (which use supervised learning to predict the absolute scores of individual results), pair-wise methods (which attempt to classify pairs of documents as correctly or incorrectly ordered), and list-wise methods (which consider the entire rank list as a whole) 29
. However, all of these methods are excessive for question answering tasks because QA systems prioritize ranking at least one
answer highly rather than ranking all of the documents in the correct order. As a result, our ML based re-ranking system uses a method similar to that described by Moschitti and Quarteroni 30
in which question/answer pairs are classified as “valid” or “invalid”.
As with the rule-based QA system, the ML-based system first uses the paragraph baseline system to retrieve the top 100 paragraphs for the question, then the question and each answer paragraph are run through the annotation system. The annotated questions and answers are then used to generate feature vectors for training and classification.
For each UMLS entity in the question and each UMLS entity in the candidate answer paragraph, the system generates a feature in the vector equal to the frequency of that entity (typically 0 or 1, although some questions and answers contain more than one UMLS entity of the same type). The system also generates features for the token frequencies in the question and answer, for the original baseline score and for the expected answer type.
For each question/paragraph pair in the training dataset (for the top 100 paragraphs, as noted above, for each question), a training instance is generated using the computed feature vector and a binary value indicating whether the question/paragraph pair is valid (the gold standard annotation indicates that the paragraph answers the question). The SVM-Light training system is then used to create a corresponding binary classifier.
During the QA process, the re-ranking system multiplies the baseline scores by the answer probability from the answer classifier. This probability is obtained by fitting a sigmoid function to the classifier output (over the training dataset) using an improved variant of Platt calibration 31
as described by Lin, Lin, and Weng 32
. In experiments with the training dataset, we found that this produced better results than ranking all paragraphs classified as “answer” above those classified as “not answer”.