|Home | About | Journals | Submit | Contact Us | Français|
Most patient care questions raised by clinicians can be answered by online clinical knowledge resources. However, important barriers still challenge the use of these resources at the point of care.
To design and assess a method for extracting clinically useful sentences from synthesized online clinical resources that represent the most clinically useful information for directly answering clinicians’ information needs.
We developed a Kernel-based Bayesian Network classification model based on different domain-specific feature types extracted from sentences in a gold standard composed of 18 UpToDate documents. These features included UMLS concepts and their semantic groups, semantic predications extracted by SemRep, patient population identified by a pattern-based natural language processing (NLP) algorithm, and cue words extracted by a feature selection technique. Algorithm performance was measured in terms of precision, recall, and F-measure.
The feature-rich approach yielded an F-measure of 74% versus 37% for a feature co-occurrence method (p<0.001). Excluding predication, population, semantic concept or text-based features reduced the F-measure to 62%, 66%, 58% and 69% respectively (p<0.01). The classifier applied to Medline sentences reached an F-measure of 73%, which is equivalent to the performance of the classifier on UpToDate sentences (p=0.62).
The feature-rich approach significantly outperformed general baseline methods. This approach significantly outperformed classifiers based on a single type of feature. Different types of semantic features provided a unique contribution to overall classification performance. The classifier’s model and features used for UpToDate generalized well to Medline abstracts.
Online clinical knowledge resources contain answers to most clinical questions raised by clinicians in patient care. Yet, over 60% of these questions go unanswered. Despite significant adoption of online resources and advances in information retrieval technology, important barriers, such as lack of time and efficient access to answers, still challenge clinicians’ use of clinical knowledge resources. Previous efforts to address this problem include methods such as question answering  and text summarization.
Previous efforts have focused on processing abstracts or full-text articles from the primary biomedical literature, with promising early results in laboratory settings.[3, 4] Yet, consuming the primary literature is labor intensive and not compatible with busy clinical workflows. Rather, clinicians prefer online resources, such as UpToDate and Dynamed, that are written by experts who synthesize the latest clinical evidence on a specific topic.[5–9] These resources are very different than the primary literature, both in terms of discourse and structure. While the primary literature focuses on presenting the results of clinical studies, synthesized resources provide clinically actionable recommendations that can be applied to specific patients. Previous studies in clinical decision support (CDS) systems have shown that clinically actionable recommendations are more effective in improving providers’ performance and patient outcomes. For example, a meta-analysis investigating the use of abatacept for rheumatoid arthritis concluded that “there is moderate-level evidence that abatacept is efficacious and safe in the treatment of rheumatoid arthritis.” On the other hand, UpToDate provides a patient-specific recommendation that synthesizes the evidence in the primary literature: “abatacept may be used as an alternative to a TNF inhibitor in patients in whom MTX plus a TNF inhibitor would otherwise be appropriate, particularly in patients unable to use a TNF inhibitor and in patients with a high level of disease activity.”
An important step in question answering and text summarization is the extraction of key sentences, typically based on a sentence ranking algorithm. We propose that, to provide effective CDS, question answering and text summarization tools should focus on extracting clinically actionable recommendations from online information resources, particularly from synthesized resources such as UpToDate. In a previous exploratory study, we tested the feasibility of addressing this problem with a simple semantic-based approach. In the present study, we developed and assessed a feature-rich (i.e., semantic and text-based) classification model that extracts, from synthesized resources, the most useful sentences to support patient-specific treatment decisions. We also conducted an exploratory assessment of the model’s generalizability to sentences from the primary literature.
In question answering and text summarization systems, researchers have used a variety of salient sentence extraction techniques. The first technique was proposed by Edmundson in 1969, in which a score is calculated for each sentence using a statistical function based on cue phrases, keywords, and sentence location. Since then, several variations of the “Edmundsonian paradigm” have been investigated. Lin proposed modeling relevant keywords in specific domains as a mechanism to assess sentence relevance. Sentence position has also been used as a predictor of salient sentences, such as the first paragraph of news articles and the conclusion section for scientific articles. Another common technique is to look for specific cue words, such as “in conclusion” or “in summary.” Machine learning approaches have been proposed using Edmundsonian features as input for the binary classification of sentence relevance.[16, 17] Finally, popular graph-based approaches, such as TextRank, use graph algorithms to compute the similarity between a topic and sentences as well as among sentences themselves.[18–22]
According to a systematic review of recent biomedical text summarization techniques, researchers have employed variations of the sentence extraction methods described above. The most notable differences are the ubiquitous preference for knowledge rich methods that leverage domain-specific tools, such as the Unified Medical Language System (UMLS), MetaMap,(50) and SemRep;(51) as well as the increased adoption of hybrid approaches.[3, 23–25]
Despite substantial work on sentence extraction for biomedical text summarization and question answering, previous studies have focused largely on the primary literature. A common goal has been to generate summaries that resemble article abstracts written by study authors. However, previous research has shown that, for clinical decision-making, clinicians prefer synthesized resources to the primary literature. In addition, clinicians prefer patient-specific, actionable recommendations, as opposed to a general overview.[5–9] Specifically, useful sentences (see Table 1 for definitions and examples) should contain an explicit assertion about a specific treatment and for a specific type of patient or patient population. Algorithms need to consider these attributes in order to extract clinically useful sentences. On the other hand, general sentence extraction approaches aim to identify topic-relevant sentences or representative sentences to produce an overview. Therefore, these approaches are not optimal for clinical decision support.
In the present study, we investigate a hybrid method to extract clinically useful sentences from synthesized evidence resources such as UpToDate, a popular knowledge resource written by clinical experts in various medical specialties. We employ a feature-rich approach based on supervised machine learning techniques and a set of Edmundsonian and semantic NLP features. Similar hybrid approaches have been employed in previous sentence extraction research. Our main contribution consists of identifying a rich set of features that serve as predictors of clinically useful sentences and can be effectively used for the extraction of those sentences for clinical decision support. The following sections describe the NLP tools used to extract these features.
MedTagger, an extension of the cTAKES NLP pipeline , is a modular system of pipelined components that combine rule-based and machine learning techniques to extract semantically viable information from clinical documents. For each sentence in a document, MedTagger extracts a set of concepts from the Unified Medical Language System (UMLS). The process includes OpenNLP tokenization , lexical normalization , and dictionary-based concept extraction according to the Aho-Corasick algorithm  using both the UMLS Metathesaurus and MeSH. Overall, the precision and recall for MedTagger on the CLEF 2013 shared task were 80% and 57% respectively for strict evaluation and 94% and 77% respectively for relaxed evaluation. The accuracy of MedTagger on a corpus depends on the coverage of the UMLS Metathesaurus in the specific domain of interest and on the accuracy of the lexical normalization resource, both of which have been extensively used in several text mining systems. While the system has not been intrinsically evaluated for biomedical abstracts, MedTagger has been used in previous studies as a component of literature-mining pipelines.[19, 30]
SemRep is a semantic NLP parser that uses underspecified syntactic analysis and structured domain knowledge from the UMLS to extract semantic predications. Semantic predications are relations that consist of a subject, a predicate, and an object. The subject and object of predications are represented with UMLS concepts. Predicates consist of semantic relations such as IS_A, TREATS, and AFFECTS. For example, from the sentence below:
“Quinidine, procainamide, and disopyramide are recommended for patients with atrial fibrillation”
SemRep extracts the following predications:
Atrial Fibrillation PROCESS_OF Patients
Disopyramide TREATS Atrial Fibrillation
Procainamide TREATS Atrial Fibrillation
Quinidine TREATS Atrial Fibrillation
In an exploratory study, we concluded that the number of semantic predications in a sentence is correlated with clinically useful sentences. A subset of SemRep predicates is more relevant to sentences that describe disease treatment, such as TREATS and the comparative predicates COMPARED_WITH, HIGHER_THAN, LOWER_THAN, and SAME_AS. Comparative predicates are extracted in sentences that contrast two treatment alternatives. For example, from the sentence below:
“Etanercept and adalimumab might be safer than infliximab.”
SemRep extracts the following comparative predications:
Adalimumab HIGHER_THAN infliximab
Etanercept HIGHER_THAN infliximab
In a previous study, SemRep's precision and recall for treatment-related predications (i.e., TREATS and comparative predications) were 78% and 50% respectively.
The study method consisted of: 1) extension of a gold standard of labeled sentences from UpToDate; 2) extraction of semantic and text-based features for sentence classification; 3) development of classification models for identifying clinically useful sentences; and 4) testing of a set of hypotheses regarding the performance of these classification models.
We extended a gold standard developed in a previous feasibility study. The resulting gold standard consisted of all 4,824 sentences from 18 UpToDate documents on the treatment of six chronic conditions: coronary artery disease, hypertension, depression, heart failure, diabetes mellitus, and prostate cancer. These documents were selected through a manual search using UpToDate’s search engine. From the search results, we selected the documents most frequently accessed by clinicians according to UpToDate’s usage log.
Sentences were rated independently by three raters with clinical background (two physicians and one dentist) according to a 5-point clinical usefulness scale (Table 1). The rationale supporting this rating approach is based on evidence that clinical decision support tools that provide patient-specific, actionable recommendations are more likely to produce positive outcomes.[10, 34] The rating scale with instructions was developed iteratively in three stages. In the first stage, two raters independently rated sentences from one document reaching an inter-rater agreement of 0.52 (linear weighted kappa). After reconciling disagreements, the instructions were refined and another set of sentences were rated reaching an inter-rater agreement of 0.74. A third rater was included for the remainder of the documents after further refinement of the rating instructions. The final inter-rater agreement obtained a linear weighed kappa of 0.82. To test algorithm generalizability, a similar rating process was followed to produce a gold standard of all 1,072 sentences from a random set of 140 recent PubMed abstracts that reported results of randomized clinical trials.
To extract clinically useful treatment sentences we built a classification model based on a set of features extracted from the document’s sentences. These features were selected based on domain knowledge derived from experimental research.[10, 34, 35] More specifically, we followed three underlying principles to guide the selection of potential features: (i) clinically useful sentences should have one or more actionable treatment recommendations; (ii) candidate sentences should define the types of patients (i.e., population) that qualify for a particular recommendation; and (iii) recommendation statements should be assertive, with constructs such as deontic terms (e.g., “we recommend”, “we suggest”), and should include attribution to an evidence source (e.g., “according to the ACC guideline…”).
To capture these attributes, we extracted a combination of semantic and text-based features (cue words). Semantic features included UMLS concepts, UMLS semantic groups , semantic predications, and patient population. Text-based features were extracted directly from the UpToDate dataset using text-based feature selection techniques.
Since the focus of the present study was to extract treatment recommendations, we narrowed MedTagger’s output to treatment-related concepts. Each extracted UMLS concept was mapped to one of four UMLS semantic groups: Chemicals & Drugs (CHEM), procedures (PROC), physiology (PHYS), and disorders (DISO). Moreover, in order to avoid recommendations that are too general and therefore not clinically useful, concepts with the following general semantic types were removed from our dataset: “Body Part, Organ, or Organ Component”, “Neuroreactive Substance or Biogenic Amine”, “Nucleic Acid, Nucleoside, or Nucleotide”, “Amino Acid, Peptide, or Protein”, and “Functional Concept“. These semantic types were selected from a subset of the root semantic types based on domain knowledge. We derived five features from the concepts extracted from each sentence: total number of concepts in the sentence (one feature) and number of concept instances per UMLS semantic group (four features). For instance, the following sentence contains two concepts in the semantic group procedures and one concept in the semantic group disorders:
“For most patients with cardiovascular disease (DISO), we do not recommend anticoagulant therapy (PROC) if they are taking recommended antiplatelet therapy (PROC).”
To extract semantic predications, we processed UpToDate documents available in the gold standard with SemRep. We derived seven features from semantic predications: total number of predications with a treatment-related predicate (one feature) and number of predication instances per treatment-related predicate, including negated predicates when applicable, (six features): TREATS/NEG_TREATS, ADMINISTERED_TO /NEG_ADMINISTERED_TO, AFFECTS/NEG_AFFECTS, PROCESS_OF / NEG_PROCESS_OF, PREVENTS / NEG_PREVENTS, and COMPARED_WITH / HIGHER_THAN / LOWER_THAN / SAME_AS. For instance, from the sentence below:
“We suggest that methotrexate (MTX) be used as the initial DMARD for patients with moderately to severely active RA, rather than another single nonbiologic or biologic DMARD or combination therapy.”
SemRep produces the following output:
Rheumatoid Arthritis PROCESS_OF patients
Antirheumatic Drugs, Disease-Modifying (DMARD) TREATS Rheumatoid Arthritis
Combined Modality Therapy TREATS Rheumatoid Arthritis
which yields the following features:
Total number of predications: 3
PROCESS_OF instances: 1
TREATS instances: 2
Patient population determines whether a sentence includes a description of the types of patients who are eligible to receive a certain treatment. For instance:
“In patients with inadequate glycemic control on sulfonylureas, with A1C >8.5 percent, we suggest switching to insulin.”
“For patients with adrenergically-mediated AF, we suggest beta blockers as first-line therapy, followed by sotalol and amiodarone.”
To identify the population of interest, we developed a pattern-based method that returns a population phrase. The method uses two NLP parsers, the Stanford lexical parser  and Tregex ; 130 population-related concepts obtained from the patient or disabled group UMLS semantic type; and 22 terms that were manually identified in UpToDate sentences not included in the gold standard. First, sentences with population-related concepts or terms are filtered. Second, each filtered sentence is processed with the Stanford lexical parser to generate a constituent tree of verb and noun phrases. Third, the node labels are queried using Tregex, a tree query language for querying expressions of a parse tree. Finally, the algorithm extracts the population phrase identified. The Tregex patterns are similar to regular expressions, but more advanced and easier to use (Table 2). A binary feature was produced to indicate whether a sentence includes a population or not.
A preliminary analysis of the algorithm with a gold standard of 1825 sentences from UpToDate yielded a precision and recall of 91% and 97% in identifying sentences that mentioned a patient population (unpublished data). A formal experiment to assess the performance of this approach is underway. Given the optimal performance and the simplicity of the algorithm (e.g., it does not depend on availability of training data), we opted not to experiment with other alternatives for population identification, such as supervised learning techniques.
Text-based features consisted of a set of potentially useful cue terms, such as deontic terms (e.g., suggest, recommend). To identify these terms we selected from the gold standard a training set of three random documents, which were excluded from later experiments. All bigram terms from these documents were extracted and the top 15 terms with the highest Pearson correlation values were selected. This approach is aligned with the method proposed by Hall et al. Other feature weighting methods such as information gain  and gini index  returned the same 15 terms. These terms were manually inspected and grouped into four term categories based on domain knowledge (four features). From these categories, we derived four features that consisted of the number of cue terms per term category in a sentence (Table 3): (i) references to other documents, such as “is discussed elsewhere”; (ii) terms related to study design, such as random and placebo; (iii) terms used in deontic modality, which overlap with terms identified by Lomotan et al.  (e.g., recommend, suggest), and terms that indicate evidence sources (e.g., guideline); and (iv) terms that denote “treatment” (e.g., therapy). The first and second categories correlated with sentences that are not clinically useful, while terms in the third and fourth groups correlated with clinically useful sentences.
From the approach above, 17 features were selected for sentence classification (Table 4). The distribution of sentences in the gold standard according to each feature category is available on Table s1 of the online supplement.
To select an optimal classifier, we evaluated six different classification algorithms: Kernel-based Bayesian Network, Naïve Bayes, Neural Network, Support Vector Machine (LibSVM), K-Nearest Neighbor, and Logistic Regression. Algorithms were evaluated with the following parameter settings: kernel type, estimation mode, and number of kernels were varied for the Kernel-based Bayesian Network; number of hidden layers, number of nodes in each layer, learning rate, and momentum were varied for the Neural Network; and Kernel type along with the corresponding parameters of each kernel type were varied for the Support Vector Machine and Logistic Regression. The value of k and the weighted voting approach were changed for the K-Nearest Neighbor algorithm. Since our gold standard is unbalanced (87% negative vs. 13% positive cases), probability threshold adjusting was applied to all algorithms. The same three documents used for selecting text-based features were used for finding the best parameter setting for each classifier.
A Kernel-based Bayesian Network classifier with 50 Gaussian kernel density greedy estimators performed best and was used in subsequent experiments. This classifier is a Bayesian Network that estimates the true density of the continuous variables using kernels. A kernel is a weighting function, which is generally used in non-parametric estimation techniques. It is employed in kernel density estimation to estimate the density function of random variables. More details about this algorithm can be found elsewhere.  As shown in previous research, the Kernel-based Bayesian Network is robust to highly imbalanced datasets [44, 45] (i.e., the number of positive cases is much smaller than the negative cases), such as the one in the present study.
We conducted four experiments to test the following hypotheses:
In each of the experiments above, we used the same gold standard excluding the 3 documents that were used to extract text-based features. Ordinal ratings were converted into binominal values: sentences rated as 4 and 5 were considered as the positive class (i.e., clinically useful sentences) and the remaining sentences were considered as the negative class. As a result, 13% of the sentences in the gold standard were labeled as positive versus 87% as negative.
All experiments employed a leave-one-out strategy with 15 iterations and were implemented using RapidMiner (www.rapidminer.com). In each iteration, 14 documents were used for classifier training and one was left out for testing classification performance. To test Hypothesis #4 on Medline sentences, we employed a 20-fold cross-validation strategy with each fold containing 7 abstracts.
Classification performance was measured according to the average precision, recall, and F-measure across 15 iterations. F-measure was defined a priori as the primary outcome for hypotheses testing. For statistical significance, first we applied the Friedman’s test to verify differences among multiple classifiers. If significant at an alpha of 0.05, pairwise comparisons were made with the Wilcoxon Signed-Rank test. This statistical approach is aligned with the method recommended by Demsar , which accounts for intra-class correlation in cross-validation experiments.
Descriptive statistics of the sentences and features in the gold standard show that all feature types that were assumed to be predictive of useful sentences were more frequent in useful sentences than in non-useful sentences (Table s1 of the online supplement). Detailed results for all experiments are reported in Tables s2 to s5 of the online supplement. The Bayesian Network algorithm outperformed the other alternatives (Table s6 in the online supplement). Different parameter settings did not significantly change the performance of any of the algorithms.
In this study we investigated an automated method for extracting clinically useful sentences from synthesized online clinical resources such as UpToDate. Such a method is an important component for clinical evidence summarization and question answering systems aimed at assisting clinicians with patient-specific clinical questions and decision-making. Based on a recent systematic review of biomedical text summarization, this is the first study to investigate methods to extract clinically useful sentences from synthesized evidence resources. Also, we conducted an exploratory investigation of the generalizability of the method to the primary literature.
Overall, the feature-rich classifier had an F-measure of 74%, with a recall of 78% and precision of 72%. This precision is much higher than the overall rate of clinically useful sentences in the UpToDate documents in the dataset, which is 13%. Therefore, the feature-rich classifier could be used to enable a more efficient alternative or complementary mechanism for clinicians to peruse clinical evidence from clinical knowledge resources. One advantage is that in addition to classifying sentences, our method generates rich sentence-level metadata, which could be leveraged by interactive text summarization tools and to enable semantic integration with electronic health record (EHR) systems.[47, 48] We are currently designing a context-aware clinical knowledge summarization tool that employs the feature-rich sentence classifier algorithm along with a sentence ranking algorithm based on a clinician’s information needs. The tool is designed to integrate with EHR systems via OpenInfobutton, an open source Web service compliant with the Health Level Seven (HL7) Infobutton Standard. A description of the tool along with results of a formative evaluation are available elsewhere.[50, 51]
We conducted four experiments to test four hypotheses. The first experiment showed that the domain-specific method, which is based on semantic and syntactic features of sentences, significantly outperformed the feature co-occurrence method. This is an expected outcome, since the domain-specific method looks for specific characteristics that contribute to the clinical usefulness of sentences. Specifically, the features used in the feature-rich classifier were fine-tuned based on domain knowledge. This finding also highlights the importance of state-of-the-art, biomedical semantic understanding methods and tools, such as MetaMap  and SemRep, in the context of biomedical text summarization.
The second and third experiments showed that each feature type provides additional contribution to the overall classification performance. This may be explained by the different strengths and weaknesses of each feature type. For example, semantic predications and concepts identify relations between treatment interventions and conditions; the population algorithm identifies the definition of a specific patient population; and cue words identify patterns associated with useful sentences (e.g., deontic terms, evidence attributions) and non-useful sentences (e.g., study design terms). Moreover, the results showed that predication and concept-based features improved recall, while the population-based feature improved precision. Although individual text-based features performed significantly worse than other feature types, text-based features improved overall precision because cue words, such as deontic terms, could not be detected by the other feature types used in our study. The fourth experiment suggests that the classifier’s model and features used for UpToDate are generalizable to Medline, possibly because the structure and semantics of clinically useful sentences are similar among different online clinical resources. We recently completed a more thorough analysis that confirms these exploratory findings.
Error analysis identified two main categories of misclassification errors. False-positive cases were caused by a wide range of problems. We describe three categories that occurred most frequently. The first one (13% of the cases) consisted of recommendations that were too general. For example, in “All patients with CVD should have measurement of waist circumference and calculation of body mass index,” both the population (“all patients”) and the intervention (“waist circumference and body mass index”) are too general. Our algorithm includes a mechanism to exclude general concepts, but a more sophisticated approach is needed help to further improve performance. The second category (16% of the cases) consisted of sentences that present details about the implementation of a certain treatment, such as “In comparison, amiodarone is metabolized in the liver and dose adjustment is probably necessary in patients with hepatic dysfunction.” While these sentences may be useful once a clinician identifies a recommendation that applies to her patient, they present details that would not be important in the first tier of a text summary. The third category (6% of the cases) consisted of sentences that contained all the desired features, except that they lacked a specific patient population, such as “Although some have suggested that combination antiarrhythmic drug therapy may be an alternative, there are limited data to support such an approach and the patient may be exposed to a greater risk of proarrhythmia and other side effects.”
False-negative cases were due to two main categories. The first one (56% of the cases) consisted of useful sentences from which SemRep and MedTagger failed to extract useful treatment predications and concepts, such as “non-pharmacologic therapies that may benefit patients with angina refractory to the above medical therapies include enhanced external counterpulsation, spinal cord stimulation, and transmyocardial revascularization.” Improvements in the coverage of the underlying controlled terminologies used by SemRep and MedTagger would address most of these problems. The second category (9% of the cases), consisted of deontic and population expressions that our algorithm did not account for (e.g., “we consider,” “we use”), such as in “we use calcium channel blockers and nitrates routinely to relieve symptoms when initial treatment with beta blockers is not successful or if beta blockers are contraindicated or cause side effects.”
Our study had a few limitations. First, our approach was tuned to extract treatment recommendations and most likely needs to be adapted to extract diagnostic recommendations. For example, a different set of predicates and semantic groups would be necessary. Second, the experiments were conducted in a relatively small subset of UpToDate documents. Yet, the sample size had enough statistical power to detect differences among the various approaches tested. In addition, documents were chosen based on a representative sample of common and complex chronic conditions that affect a large patient population. Therefore, it is expected that our experimental results will generalize to other documents on the treatment of similar conditions.
We investigated domain-specific supervised machine learning methods using a rich set of semantic and syntactic features to classify clinically useful sentences in UpToDate articles. The feature-rich approach significantly outperformed classifiers based on a single type of feature. Different types of semantic features provided a unique contribution to overall classification performance. The Kernel-based Bayesian Network method outperformed other machine learning algorithms. In future studies, the resulting sentence classifier can be used as a component in text summarization and question answering systems to help clinicians’ decision-making.
This project was supported by grants 1R01LM011416-01 and 4R00LM011389-02 from the National Library of Medicine.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.