The following section describes the data used and the components of our system.
It is a multisystem disease with the respiratory tract being the most severely
affected and respiratory failure being the common cause of death if lung
transplantation is not available. The gastrointestinal tract including the pancreas
and liver are commonly involved with resultant malabsorption. Ongoing pancreatic
disease may also result in diabetes. Liver involvement may result in the development
of liver damage with cirrhosis and portal hypertension. Abnormal function of the
CFTR results in loss of sodium and chloride in the sweat and may lead to serum
electrolyte abnormalities. In males with CF congenital absence of the vas deferens
occurs in 95% of patients, rendering them infertile.
The important management strategies in the treatment of CF are:7
- management by a multidisciplinary team;
- treating respiratory tract infections;
- maintaining a high caloric, high fat diet;
- airway clearance i.e. ensuring the airways are free of mucous plugs and
The multisystem nature of the disease and the multidisciplinary treatment strategies
means that any search engine must be able to cover many aspects of medicine from
basic genetics to therapeutics, physiotherapy and dietetics. It is also important
the results from literature searches be clinically relevant, current and in a format
not requiring further searches through hundreds of pages. This was the challenge of
this project. Blood electrolytes and glucose levels are often abnormal in patients
with CF. We thought that searching on these biochemical indices would be an
appropriate starting point to test the search engine logic.
The corpus of publications indexed is the subset of all CF- related publications
accessible through PubMed. As The system developed has been designed to be used
by a multi- disciplinary team of medical researchers sharing care for CF
patients, articles describing studies are of high relevance. For another
audience in different applications it might be more suitable to limit the corpus
to review papers or even clinical practice guidelines.
Our tool is based on the probabilistic search engine PADRE8
which supports the use of weights for search terms. PADRE
considers a query term with a term weight of 2 to be twice as important as a
query term with the default weight of 1. Query terms and synonyms are therefore
weighted in order to distinguish between more or less relevant search terms.
Although some probabilistic search engines allow the manual weighting of query
terms, they are only exploited for internal use such as relevance feedback since
using them for manually entered queries would make the user interface too
complex. We will show example query components in PADRE syntax with alternative
terms being encapsulated in square brackets, e.g. ["search term"^weight
"alternative term"^alt weight]
The fundamental text-matching component of the PADRE ranking function is a
slightly modified version of BM25.9
gives the precise formula. The BM25
score for a document, given a multi-term query, is the weighted sum of the
scores due to each query term. Individual term scores are calculated using a
variant in which term frequencies are
document-length normalised (parameter- ised) and subjected to a (parameterised)
saturation function. Individual term scores are weighted (multiplied) by qt, the
weight calculated by our tool for term t
Some studies show that certain domain experts are reluctant to move away from
boolean retrieval systems.10
rather than just aiming for 100% recall, this tool has functions that simplify
the retrieval process, which allow it to be embedded in a doctor’s daily
Automatic query generation
The first step in taking a load off the doctor is to release him from the task of
manually entering search terms describing the patient’s situation.
This covers the first aspect of an Evidence-based medicine (EBM)11
related search strategy described as the
acronym PICO, where the doctor expresses clinical questions in four steps:
Patient Who/what is the patient/problem being
Intervention What is the intended intervention?
Comparison What is the intervention compared to?
Outcome What are the outcomes?
The query is automatically generated based on the electronic patient record. The
automatic query generation is currently based on diagnoses, the age of the
patient and results of recent blood test results.
Simple search terms:Diagnoses are already added manually into the
patient record by one of the doctors in previous consultations. We extract those
underlying diagnoses from the patient record and weight them with 1 as it would
be as manually added search term.
Example query component:"Cystic Fibrosis"^1
Fuzzy mapped search terms:PubMed uses Medical Subject Headings
(MeSH) terms to encode the age of the patients described within a publication.
Those age-related MeSH terms are defined crisply and a PubMed search for a
certain age group would not necessarily return publications mentioning patients
that are only one year older. We compensate for this disadvantage of indexing
using crisp MeSH terms. The age of the patient is mapped to age- specific search
terms and weights are calculated based on fuzzy membership functions describing
each of those terms. Table 1 lists the age-related MeSH terms and their crisp
definition, as well as the fuzzy definition we expanded it to by adding the
sides of the trapezoids. Notice that we only generated fuzzy membership
functions for the age ranges of the patients currently in the database. Most
importantly we extended the age range for the term Adult, which
is defined within MeSH as 19 to 44 years. We do not believe that this is the
common understanding of the term Adult. To reflect this, we de
facto removed the upper boundaries for Adult and
Aged by setting it to the artificially high age of 200
years to keep the algorithm simpler whilst still effectively removing one
shows two overlapping trapezoidal
membership functions for Adolescent and Young
Adult and how search term weights are generated for a given age of
20 years. Using a crisp definition as in , only publications labeled with the MeSH term Young
Adult would be returned. Our system however also retrieves
publications indexed with the MeSH term Adolescent, but rank
that search result lower, due to the reduced term weight.
Calculating term weights for age-related MeSH terms using fuzzy
membership functions for a 20 year old patient. The fuzzy membership
function for Adolescent is drawn in red (dash-dotted line), the one for
Young Adult in green (dashed line).
Age-related MeSH terms and their crisp definition compared to our
fuzzy definition, for age ranges of patients in the database
The weight for an age-related search term is computed using for trapezoidal fuzzy membership functions, using
the values a, b, c, d defined in .
Weights based on fuzzy membership functions.
Example query component: ["Young Adult"^1 "Adolescent" ^0.5]
Extremeness of blood test results:The results from recent blood
tests are compared to minimal and maximum values provided by the lab for each
measurement. Using those ranges provided by the labs did not prove sufficient as
for some blood tests such as glucose, where the mean value of the cohort of
patients is already higher than the maximum defined by the lab. This phenomenon
is caused by the fact that all patients in the database are CF patients and CF
patients tend to develop CF-related diabetes.
Our tool also takes into account the cohort of patients in the database. The
variability in the blood tests is shown in a Parallel Coordinates Plot.12
Each line shows the most recent values of blood tests for each
patient in the database. On the four vertical parallel axes labelled calcium,
glucose, potassium and sodium, the data points are scaled to plot the lowest
recorded value on the bottom and the highest recorded value on top of the axes.
This visualisation was used to examine the variation in combinations of
expressions across the patient cohort. The plot has been produced using the
statistical environment R and a package called iPlots. The iPlot package allows
the user to select a certain line or group of lines to see the combination of
expressions. In the screenshot, a group of patients sharing a similarly low
sodium level have been selected causing the lines to be drawn in red.
A Parallel Coordinates Plot showing the variety of combinations of
blood test results in the cohort of patients.
This type of plot clearly shows that even though the five patients are low in
sodium, the values of other blood test types are widely distributed. This
variability in distribution shows that every patient’s data is likely to
generate a unique query even without taking the other query components into
Let p(x) denote the percentage of observations of the metabolite that
are lower than the observation x, for the patient of interest. Therefore x
is the p(x) percentile of the empirical distribution of the metabolite
across all patients). We decided to increase the weight of search terms
according to the following empirically defined exponential weighting .
According to Formula 2, a test result at the 50th percentile (the median) will
receive a weighting of 0.1. This small weight will effectively result in the
search term becoming insignificant. Outlying values however will receive higher
weights, for example a test result in the 5th (or 95th) percentiles will receive
a weighting of 8.6 and therefore rank publications containing the search term
Introduction of homonyms and conditional homonyms We manually
compiled a list of homonyms to allow the specialist to fine-tune the system for
If the patient’s test result is out of the range provided by the test lab,
we also add the appropriate terms describing this condition. A high or low
glucose level would therefore trigger the alternative search term
Hyperglycemia or Hypo-glycemia
respectively to be added. The weight for this alternative search term is still
determined by the percentile of the value.
Example query component: ["Glucose"^0.2 "blood sugar"^0.2
User interface design and result presentation
The design of the user interface was mainly guided by the following five design
goals: easy to learn, simple to use, fast, comprehensible and visually
Screen shot of the medical literature retrieval tool