|Home | About | Journals | Submit | Contact Us | Français|
This paper reviews work over the past two years in Natural Language Processing (NLP) applied to clinical and consumer-generated texts.
We included any application or methodological publication that leverages text to facilitate healthcare and address the health-related needs of consumers and populations.
Many important developments in clinical text processing, both foundational and task-oriented, were addressed in community-wide evaluations and discussed in corresponding special issues that are referenced in this review. These focused issues and in-depth reviews of several other active research areas, such as pharmacovigilance and summarization, allowed us to discuss in greater depth disease modeling and predictive analytics using clinical texts, and text analysis in social media for healthcare quality assessment, trends towards online interventions based on rapid analysis of health-related posts, and consumer health question answering, among other issues.
Our analysis shows that although clinical NLP continues to advance towards practical applications and more NLP methods are used in large-scale live health information applications, more needs to be done to make NLP use in clinical applications a routine widespread reality. Progress in clinical NLP is mirrored by developments in social media text analysis: the research is moving from capturing trends to addressing individual health-related posts, thus showing potential to become a tool for precision medicine and a valuable addition to the standard healthcare quality evaluation tools.
The promise of a wider use of clinical Natural Language Processing (NLP) in healthcare information technology (IT) has been at our fingertips for decades, with several successful applications integrated in daily care, such as MedLEE . Yet clinical NLP, i.e., natural language processing methods developed and applied to support healthcare by operationalizing clinical information contained in clinical narrative, remains an emerging technology. The gap between the promise and reality is becoming noticeable to the potential beneficiaries of clinical NLP: a recent editorial in Circulation: Cardiovascular Quality and Outcomes notes that most of clinical NLP current successes are restricted to research settings . The authors then state: “NLP tools do not perform well enough for focused clinical tasks like real-time surveillance, quality profiling, and quality improvement initiatives, and the focused NLP tools tend to lose performance in clinical environments outside of their development frame. […] As a result, NLP use in clinical operations has been limited.” As a community, NLP researchers need to take urgent steps to amend the situation; the work discussed in this review gives some hope towards this goal.
This review includes any application or methodological publication that leverages texts to facilitate healthcare and addresses the needs of consumers and populations. We focus on the areas where we see much active contribution of NLP techniques in the past two years, with few exceptions for somewhat older papers. Activity in these areas is facilitated by a dramatically growing availability of the texts to researchers and the changing culture that promotes sharing tools and resources.
We omit discussing basic research recently reviewed by Névéol and Zweigenbaum  that is however still needed and ongoing. Some examples include exciting new approaches proposed in the context of community challenges: 2012 i2b2 event and time extraction , 2014 i2b2/UTHealth modeling of risk factors for heart disease , ShARe SemEval 2014 recognition and normalization of disorders [6, 7], and ShARe SemEval 2015 disorder and template filling shared tasks . We refer the reader to the overview papers for each task to learn more about the different proposed approaches. Similarly, we do not discuss NLP methods developed within the Text Retrieval Conference (TREC) Medical Records and Clinical Decision Support tracks that respectively focused on finding patient cohorts using EHR notes and eligibility criteria for the Medical Records track, and finding relevant publications and published case studies given a description of a patient’s case for the Clinical Decision Support track. The overviews and the detailed descriptions of individual efforts are available in TREC publications [9, 10]. We also leave the discussion of the international developments to the MEDINFO 2015 panel “Current perspectives on NLP for electronic medical records” , recent editions of CLEF eHealth challenges, and the AMIA 2014 panel “Clinical Natural Language Processing in Languages Other Than English” .
Since Névéol and Zweigenbaum provide information about methods , we only mention here that the methods in the included papers range from regular expressions that dominate research in social-media text processing, to event extraction in a supervised setting. More recently, there has been more and more progress in incorporating the principles of distributional semantics into an NLP pipeline, and a shift towards more semantic parsing, however more work in these areas is needed.
In the rest of the paper, we discuss in greater depth disease modeling using clinical texts (section 2), patient cohort selection (section 3), secondary uses of clinical data, i.e., uses outside of direct health care delivery  (section 4), support for hospital operations (section 5), and support for individuals and populations that often relies on text analysis in social media (section 6). We conclude the review with a critical reflection on the success and constraints of using NLP methods in clinical settings and propose an outlook for the future.
Representing disease in a computable form has been one of the long-standing research activities in the field of biomedical informatics. Here, we consider computational modeling and phenotyping from the standpoint of representing a disease, either through hard-coded or probabilistic rules over clinical observations in the patient record. In recent years, phenotyping efforts have considered the electronic health record (EHR) as one of the primordial sources [14,15].
Disease representation approaches that leverage information conveyed in clinical notes and reports have focused mostly on modeling one disease at a time. In this setup, the goal is to classify a given patient record as positive or negative for the presence of a specific disease. The task is often cast as a classification problem, and the set of discriminative features represents a model of the disease, which can be examined and interpreted by experts. Not surprisingly, the challenge lies in selecting the features that yield the most accurate representation of the disease.
Most methods for identifying features from patient notes use the approach we outline next (see  for a review and more recently [17–20]). Experts provide a list of phenotype-related terms that will constitute the basis of the disease model. To account for lexical variations in the expert-provided terms, the terms are mapped to standard terminologies (e.g., terms related to signs and symptoms are mapped to SNOMED-CT, and medication names are mapped to RxNorm). The augmented, custom dictionary is then used as part of a standard NLP pipeline that extracts term mentions and modifiers such as negation and uncertainty. In this scenario, the text surrounding the mentions of the controlled vocabulary terms is ignored by the disease-modeling task. The disease model is thus heavily dependent on the expert-provided initial phenotype description.
To ease the reliance on the curated list from domain experts, Yu and colleagues propose to identify features in the following fashion : to identify the terms of interest for a target disease, they look for candidate terms in publicly available documents that are known to discuss the disease (e.g., Wikipedia, Merck manual). These candidate terms are then screened further by examining their distribution over notes in the EHR. When tested on the modeling of two diseases (rheumatoid arthritis and coronary artery disease), the phenotypes with automatically generated features were more accurate at identifying patients with the disease than the ones using expert-curated features.
Luo and colleagues  propose a different approach to curated features. In their phenotyping scenario, input data come from clinical reports. Sentences are translated into a graph representation, using Unified Medical Language System (UMLS) for concept/node mapping. The sentence subgraphs are then mined to identify frequent and duplicate graphs. Luo et al. tested their approach on modeling three types of lymphomas from pathology reports. They showed that using sentence subgraphs as features yields better disease classification than using words alone (n-grams) or UMLS terms. More importantly, for the sake of disease modeling, they confirmed that the semantic modeling of the report content yields better phenotyping performance.
Although semi- and distantly supervised methods still require an annotated gold standard, they are less reliant on manual annotations. In order to reduce this complex and time-consuming step, Halpern and colleagues introduced the concept of anchor variables, which they defined as an intermediary key observation about the patient, that might in turn be relevant for phenotyping (e.g., an anchor might be defined as “has cardiac etiology”) . They learned the anchors in an automated fashion through unlabeled data and ontological knowledge. They also provided an interface to elicit from domain experts the relevance of the anchors for a specific phenotyping task.
Recently, methods to identify sub-groups of a given disease have been investigated. We note, however, that most approaches to date do not rely yet on the notes and reports (e.g., ). Approaches to model a large number of diseases at once have also been proposed in the literature. There, the point is not precise disease modeling, but rather, to model a map of patient characteristics across diseases. With such multiple disease models, patients can be characterized according to all their conditions, rather than according to a single condition, and diseases can be studied in the context of each other. Like in the subgroup identification work, most approaches currently examine structured data and ignore patient notes [25, 26]. One exception is the UPhenome model , which leverages notes. It uses a probabilistic graphical model to jointly learn a representation over the words in the notes, the structured part of the patient record, and a very large set of latent variables representing phenotypes to be learned. When evaluated on 50 random phenotypes from the 750 learned, clinical experts found that the learned disease models were coherent and representative of the corresponding diseases.
Both the prospective clinical studies that need to find and enroll eligible patients and the retrospective studies that are increasingly relying on secondary use of EHR data turn to clinical notes to extract some of the patients’ characteristics. Systems that identify cohorts eligible for a clinical trial are described in the 2010 review of Weng and colleagues . Several studies have since included NLP in addressing the important problem of identifying eligible patients across an institution’s EHR. This section does not discuss a large body of work in two related areas: the analysis of eligibility criteria in the description of clinical studies, e.g., in ClinicalTrials.gov (see  for an example), and the approaches to standardize formal representations of the inclusion and exclusion criteria .
Extraction of cohort characteristics from clinical texts ranges from the relatively simple task of identifying a single eligibility criterion  to the identification of several criteria within or across trials [31, 32]. Methods to augment the extracted information and represent the patient in a more sophisticated fashion have also been proposed .
When comparing eight automated approaches to extracting patients’ metastatic status from unstructured radiology reports, Petkov and colleagues  found that the best performing algorithm consisted of a sequence of rules encoding positive and negative relations among metastatic terms and a set of “ignore phrases” (e.g., ‘‘evaluate for metastasis”). This approach resulted in sensitivity 94%, specificity 100%, positive predictive value 90%, negative predictive value 100%, and accuracy of 99% on a set of 5,523 patients with 10,492 radiology reports.
Kreuzthaler and colleagues assessed the necessity and accuracy of information extraction (IE) from clinical notes for a cohort study on metabolic syndrome . For this study, about 50% of the needed attributes were in semi-structured document templates. Using the Apache UIMA framework and regular expressions for information extraction, the authors achieved a 0.90 F-score, identifying the main challenges for information extraction to be typing errors, inconsistency, and redundancy and spelling variants in the notes. In the outlook for NLP in cohort identification, the authors were fairly optimistic with respect to the efforts needed for adapting IE frameworks to specific information needs, provided the spelling variants and errors could be normalized to a standard vocabulary.
Ni and colleagues assessed the effectiveness of information extraction methods in automated eligibility screening for clinical trials in a urban tertiary care pediatric emergency department (ED) . To that end, the authors collected eligibility criteria for 13 clinical trials as well as demographics, laboratory data, and clinical notes for 202,795 patients visiting their ED during the trial recruitment period. The eligibility criteria were based on 15 EHR fields, seven of which were derived using text in the clinical notes. The structured fields were used to develop logical constraint filters. These filters were combined with descriptive criteria derived from the notes using information extraction, concept identification, negation detection, and elements of discourse, as well as supervised term expansion based on UMLS hyponymy of the terms identified in the notes of eligible patients. This allows to reduce the workload from 98 encounters that clinicians would have on average per trial to identify all eligible patients in the gold standard set to eight screened encounters per trial.
Because much of the eligibility criteria have a temporal aspect, Raghavan and colleagues argued that the matching of a patient’s report to the criteria needs to integrate the timing of events documented in the record . Thus, they proposed an approach to creating a timeline of a patient in a supervised fashion. The primary contribution of the work is that it orders the different events of the patient’s record spread across the different parts of the report. They represent the events and their ordering through a finite state transducer, which enables search for the best ordering. Of note, while the goal of this work is patient matching with eligibility criteria, the creation of such a timeline is a promising venue for NLP techniques.
Methods for matching patients with eligibility criteria that go beyond matching rules have been proposed as well in the recent years. Li et al.  explored methods for linking medications and their attributes in two corpora, 3,000 clinical trial announcements and 1,655 clinical notes, which represent the types of texts that will need to be linked through the common criteria representation. Li et al. compared binary classification of links to CRF-based multi-layered sequence labeling, in which each layer deals with one type of label. Both methods had comparable performance and achieved F-measures in the 80s on two different collections.
Shivade and colleagues annotated sentences related to one of the four potential eligibility criteria related to cardiac problems in 80 of the records provided in the 2014 i2b2 challenge on identifying risk of heart disease . They piloted an approach inspired by the Recognizing Textual Entailment (RTE) task  to decide whether the criterion can be inferred from patient’s record sentences. Of the four relatively simple RTE methods, semantic methods outperformed lexical methods. However, the results were low for all tested methods.
Miotto and Weng derived a target patient representation for 13 diversified clinical trials, one for each trial . The target representation consisted of four vectors, one of which was based on clinical notes, represented through their distribution over a learned topic model. For a given trial, EHR data of a new unseen patient was matched to the “target patient” using pairwise cosine similarity. Ranked patients with a similarity score above an empirically set threshold were considered eligible. When evaluating 262 participants of the 13 trials, half of which were used for training and the other half were combined with 30,000 randomly selected patients for testing, binary classification of patients as eligible or not achieved 0.95 AUC. This approach indicates that the efforts in structuring patients’ criteria and matching them to EHR data (or the literature for decision support, as is the case in the CDS track of TREC ) are a promising research direction.
Besides disease modeling and cohort selection, there are several applications in the realm of secondary use of clinical data that leverage clinical notes. They roughly fall into three types of applications: predictive analytics, pharmacovigilance and drug repurposing, and characterization of population and care patterns.
The task at hand in predictive analytics is to predict an outcome or an event of interest in the future (e.g., a patient is readmitted to the hospital). In the recent approaches to predictive analytics tasks that consider clinical notes content, text-related features consist of bag-of-words. For instance, Poulain and colleagues described a retrospective study for risk of suicide in U.S veterans, and used n-grams extracted from a small cohort of clinical records to identify potential predictors of suicide .
Goodwin and Harabagiu developed a probabilistic graph-based method to predict progression of clinical findings for individual patients . Re-using clinical notes annotated for the 2014 i2b2 challenge, they infered chronological ordering of the findings (obesity, hypertension, diabetes, hyperlipidemia, and coronary artery disease) and used probabilistic inference on the graphical model to make predictions. Although the computational approach is interesting, it needs to be further explored using currently available NLP methods, rather than gold standard annotations to build predictive models.
When predicting 30-day readmission, Caruana and colleagues took all mentions of UMLS terms in the notes with a mapping to the Core Problem List Subset of SNOMEDCT, and created predictive models that can scale to the number of patients and features . For instance, their models, trained on 195,000 patients and tested on 100,000 patients, can incorporate about 4,000 features per patient, most of them being terms extracted from the notes.While research in outcome prediction in the intensive care unit (ICU) has prevalently focused on physiological signals, there is emerging work on incorporating clinical documentation for outcome prediction. Ghassemi and colleagues explored modeling of mortality at different time ranges (in the hospital, 30-day post discharge, and 1-year post discharge) . Text-derived features consisted of the distribution over a topic model learned across a large corpus of ICU notes. When text-features were added to baseline clinical features, such as severity scores and demographics, mortality prediction yielded better performance for all time ranges, and the discriminative topics correlated with known causes of death.
To study progression of disease, Perotte and colleagues cast their work in a survival analysis framework . They proposed a method to incorporate longitudinal data and documentation prior to onset of disease into the survival model for progression. They used topic models, as learned from a large corpus of notes on patients with chronic kidney disease. In their experiments on chronic kidney disease progression, they showed that a model, which incorporates longitudinal topic models and laboratory test data, performs the best at predicting which patients are more likely to progress faster. Furthermore, like in the mortality study, they found that the significant topic models for progression correlated with known risks of progression.
Leveraging content of clinical notes to identify potential adverse drug events (ADEs) in a systematic fashion is another active area of research. Wang and colleagues extracted drug and disorder mentions from the clinical notes of 1.6 million patients to create a pool of drug-disorder pairs . The pairs were then used as instances for learning potential ADEs. For the task of drug-drug interactions, Iyer and colleagues started from a similar approach of extracting all drugs and disorders, but they included a temporal aspect in their statistical analysis . Rather than relying on global occurrence counts derived from notes mentions, Henriksson and colleagues focused on identifying explicit relations between drug and disorder mentions in the clinical notes, including within sentences and across sentences . They experimented with distributional semantics, specifically word2vec, and showed a positive impact on the learning of explicit ADE relations in notes. Of note, social media is emerging as an additional, complementary source for ADE detection to clinical data (for detailed reviews, see [46–48]).
Even for clinical events for which there exist well-defined concepts in standard ontologies, there is value in using simple keyword searches for identifying patient cohorts relevant to this event. For instance, when identifying dialysis, Abhyankar and colleagues showed that combining search over structured codes with simple keyword search of notes identified populations with a better overall performance . Researchers have proposed similar methods with good success for a range of tasks, including identifying documentation patterns of Framingham criteria in patients with and without heart failure , determining the prevalence of different indications for colonoscopies , measuring physician adherence with guidelines for medication use and behavioral modification in gout patients , identifying patterns of opioids over-prescription , and tracking the population of congestive heart failure patients in a state-wide health information exchange .
The use of distributional semantics was also found to be helpful in characterizing patterns of clinical documentations. Sullivan and colleagues used topic-model representations of clinical notes to help in detecting potential misdiagnosis in the case of epilepsy syndrome in a pediatric population . McCoy and colleagues, in order to study the relevance of research domain criteria in psychiatric care, mapped clinical notes to prevalence of documentation according to five domain criteria (negative valence, positive valence, cognitive functioning, arousal, social processes) . There, a direct keyword search approach makes less sense. For each domain, they identified a set of domain-relevant corpora of Web pages. They then transformed domain-specific web pages and clinical notes in a vector space model, using latent semantic indexing. Clinical notes were then scored according to their similarity to each domain-specific vector. McCoy and colleagues showed that their approach not only helps them score documentation with respect to domain criteria as described above, but also characterizes populations and outcomes, such as length of stay, according to these inferred scores.
Clinical NLP can have a practical impact on administrative as well as point-of-care aspects of hospital operations. Some practical impact can already be seen in such established areas as medical coding and billing. The work in this area continues to grow and is paralleled by research and some advances into practice in quality improvement and clinical decision support.
Efforts in supporting the needs of hospitals with billing and syndromic surveillance have been reported since the last two years. Perotte and colleagues proposed to leverage the content of discharge summaries to identify billing codes without restriction to a clinical domain on a corpus of 26,000 discharge summaries . Their feature set is a simple bag-of-words from clinical notes, but the classification itself leverages the hierarchical nature of the ICD-9 tree. With the adoption of ICD-10 coding, Subotin and Davis proposed a diagnosis code assignment method, which also considers a bag-of-words approach, but combines a series of assignments based in part on the structure of the ICD-10 classification . Their experiments on a corpus of 28,000 patient records show promising results for this new and complex terminology.
Towards the goal of syndromic surveillance, Haas and colleagues proposed to classify ED (emergency department) triage notes into one of three high-level categories: gastro-intestinal, respiratory, and fever-rash . Starting from a master list of terms pertinent to each category, they iteratively added terms to the list by searching the triage notes for terms similar according to lexical, context-based metric. More recently, Lopez Pineda and colleagues focused on predicting influenza in the ED. They experimented with data from four different hospitals to predict influenza from the clinician-authored reports, rather than triage notes and chief complaints .
In addition to billing and reporting activities, exploratory work in assisting healthcare organizations in improving the quality of data is ongoing. Yetisgen and colleagues developed statistical and knowledge based methods that combined publicly available tools in pipelines to support the Surgical Care and Outcomes Assessment Program (SCOAP) . The program aims to improve quality and compare effectiveness of surgical procedures across multiple Washington state hospitals. The F-scores for the 25 extracted elements in this study performed on the notes for 618 patients from one institution varied for both the statistical and rule-based methods, but are encouraging enough to warrant further research.
Raju and colleagues used keyword extraction to identify and compute adenoma detection rate using colonoscopy reports . This method outperformed manual screening by correctly identifying 91.3% of screening examinations as compared to 87.8% identified manually. Similarly, Gawron and colleagues  primarily used regular expressions for adenoma detection, achieving 0.98 and 0.99 accuracy in identifying screening indication and complete procedure, respectively. The correct location and histology of the polyp was identified with 0.94 positive predictive value, 0.94 sensitivity, and 0.94 F1 score. The numbers of polyps and adenomas were underestimated by the method.
Although not yet directly applicable to clinical narrative, an approach to formalizing quality measures developed by Dentler and colleagues allows to structure the requirements and issue database SQL queries to compute quality measures .
Despite the promise of opportunities for NLP techniques to contribute to clinical decision support (CDS) tools , there are few instances of applications that operate at the point of care and that make use of NLP technology. Dean and colleagues reported on a real-time pneumonia screening tool in the ED that provides care recommendations when pneumonia is inferred from the patient’s radiology report [66,67]. While they did not observe a significant difference in mortality across the EDs where the tool was deployed and the control EDs, they found that EDs with access and use of the tool had increased adherence with recommendations for pneumonia care. Demner-Fushman and colleagues reported on an in-depth evaluation of their evidence-based decision support tool . The evaluation showed stable use of an application that extracts concepts from the patients’ progress notes to automatically generate searches over several resources identified as useful by the NIH Clinical Center interdisciplinary teams. More recently, there has been renewed work in problem-list generation based on patient notes [69, 70] and through patient record summarization  (for a review of patient record summarization techniques that use NLP, see ).
Along with the general explosion of social media use, more and more health consumers discuss their health and their care ecosystem online, whether on general social media sites or in dedicated discussion boards. For instance, in a study of 294,000 Yelp New York City restaurant reviews, Harrison and colleagues found 468 reports consistent with foodborne illness, of which only 3% had been reported through the official channels . In a 2011 literature review, Smith concluded that consumer language is an under-researched area inside and outside of healthcare . We review here recent advances in NLP for health consumer language, as well as exciting avenues for learning from patient-generated texts.
In the past couple years, there has been some evidence suggesting that the language used in health texts should be adapted to the level of health literacy of health consumers, for them to comprehend the text. A study of FDA Drug Safety Communications revealed that changing the existing communications to plain language significantly increased consumers’ level of comprehension of the communications . Ramesh and colleagues, linking 20 de-identified progress note reports and 20 de-identified discharge summary reports to MedlinePlus, UMLS, and Wikipedia, showed Wikipedia to significantly improve self-reported EHR note comprehension by AMT workers .
There has also been much research recently in building basic NLP tools to support the automated analysis of the language authored by health consumers and patients. Basic approaches developed recently for consumer language include: spelling correction, for which Zhou and colleagues rely on Google Spell Checker , whereas Kilicoglu and colleagues are developing a stand-alone publicly available tool and corpora ; automated evaluation of errors in consumer language processing made by NLP tools ; extraction of patient demographics and personal medical information [80, 81]; Keyword in Context (KWIC) analysis to evaluate patients’ experience with primary care reported in a survey ; and a framework for finding health mentions online .
Recent developments in enriching the existing consumer vocabularies include a system developed to assist with collaborative updates an existing consumer health vocabulary ; a crowdsourcing approach to identifying medical terms in patient-authored texts ; unsupervised lexicon generation representative of the sub-language used in an online consumer community ; mining pairs of professional terms and their equivalent consumer terms from Wikipedia ; and an approach to expanding a seed vocabulary of consumer-friendly terms .
The above resources and methods serve as foundation for the more complex methods that are needed to accomplish some higher-level NLP tasks listed next. When analyzing online communications, attribution to the author of a post might be very important, if, for example, clinicians would like to intervene online. Lee and colleagues discussed prevention of back pain through the detection of risk factors, as the individuals tweeting about certain activities and health problems are likely to tweet about acute back pain shortly after . In this study, the attribution of back pain to the authors was defined by the use of personal pronouns. A more sophisticated approach to attribution of disorders to patients in health forums was proposed by Driscoll and colleagues, who casted disorder attribution as a classification task and used Brown clusters assignments and syntactic features . Other potential interventions could be based on a potential relationship between posting to an online weigh loss forum and weight changes. Hekler and colleagues suggested that the increased use of past-tense and motion words, such as go, car, and arrive, were associated with lower weekly weights of an online forum users . On the other hand, increased use of conjunctions and exclusion words (e.g., but, without, exclude) were associated with higher weights.
Understanding the interactions between patient discourse and social support has been investigated in the past several years in the context of online health discussion boards and online support groups. Using a range of linguistic features, Wang and colleagues  developed a supervised machine learning approach for predicting if a post falls into a predefined message type (e.g., positive emotional disclosure, question asking, etc.). The developed method was consistent with human judgments in establishing that when people convey their negative experiences, thoughts, and feelings, others provide them with emotional support. In another study, Vlahovic and colleagues  confirmed that there is a strong link between a discussion participant’s satisfaction and the type of support they receive and provide.
While Wang and colleagues found that requesting support and talking about exclusively negative events triggered support from others, Lewallen and colleagues  did not find that greater use of negative emotions predicted peer responsiveness; however, three other factors did: greater message length, lower use of second-person pronouns, and lower use of positive emotion words.
NLP techniques have also been used to study the associations between participation and various outcomes. For instance, Zhang and colleagues  investigated the impact of different factors on post sentiment, as assessed automatically when using a learned, forum-specific, sentiment analysis tool. They found that there is a significant increase in sentiment of posts as patients keep on posting in time, with different patterns of sentiment trends for initial posts in threads and reply posts. Zhao and colleagues  leveraged sentiment analysis as well, but for the task of identifying influential users in the community. Their working hypothesis is that influence can be approximated through a user’s ability to affect the sentiment of others. They proposed a novel metric that incorporates this hypothesis.
Social media is also an excellent venue to measure patients’ perception of healthcare quality. Not surprisingly, active research in this area is ongoing. Wallace and colleagues proposed a factorial latent Dirichlet allocation (f-LDA) model to uncover patients’ sentiments about important aspects of healthcare, such as interpersonal manner, technical competence, and systems issues, expressed in RateMDs reviews . They showed that f-LDA predictions of positive and negative sentiment correlate well with state-level measures of quality healthcare.
Researchers examined Twitter to measure patient-perceived quality of care in UK and US hospitals, respectively. Greaves and colleagues used commercial software that relies on POS tagging, syntactic parsing, compositional sentiment lexica, and a sentiment grammar to classify tweets about hospitals as positive or negative . The average sentiment about a hospital was computed as a proportion of positive tweets to the total number of tweets. The correlation between the overall patient experience score from the NHS inpatient survey and the automated Twitter sentiment analysis score was low, which might be explained by a relatively low agreement between manually rated sentiment and automated sentiment analysis. Hawkins and colleagues used publicly-available software to derive sentiment scores, and then calculated a mean sentiment score for each of 297 US hospitals with at least 50 patient experience tweets . The Twitter sentiment scores did not correlate with a formal US nationwide patient experience survey and weakly correlated with the Hospital Compare 30-day hospital readmission rate. Despite weak or absent correlation with the official hospital satisfaction metrics in both studies, the authors recommended to continue monitoring these feeds to better understand the experiences of healthcare consumers.
Drawing on data from two South Korean online communities predominantly used by parents to discuss pediatric services, Jung and colleagues defined six quality factors for social media–based hospital service quality analysis and used keywords corresponding to the factors for recommending hospitals .
Although consumers’ health information needs are well studied (primarily using search engine logs analysis), consumer-health question answering (QA) is a relatively new area, with most of the work focusing on question analysis. Roberts and Demner-Fushman analyzed several consumer and professional question answering venues and found that the form of consumer questions is highly dependent upon the individual online resource, especially in the amount of background information provided . Professionals, on the other hand, provide very little background information, and often ask much shorter questions. The content of consumer questions is also highly dependent upon the resource. While professional questions commonly discuss treatments and tests, consumer questions focus disproportionately on symptoms and diseases. Further, consumers place far more emphasis on certain types of health problems (e.g., sexual health). Cohen and colleagues have shown that interactive question answering sites could efficiently address consumer health question answering through either short answers by a small number of dedicated physicians, enabling high throughput, or physician experts operating as moderators in patient forums . Luo and colleagues used syntactic and semantic analysis to align a new question with the questions previously submitted to NetWellness, a website through which highly qualified volunteers provide answers to consumers’ health questions .
To counterbalance the somewhat pessimistic outlook expressed in the Circulation: Cardiovascular Quality and Outcomes editorial, which rightfully indicated there were very few actual NLP systems in daily healthcare use, we note the success stories and recent improvements in the approaches to established tasks, such as NLP support for coding for billing purposes and quality improvements, patient record summarization, as well as a growing contribution to retrospective studies and phenotyping algorithms. We also note the successful integration of an NLP-based algorithm for finding congestive heart failure in a live health information exchange .
To get more of the success stories for clinical NLP in practice, the NLP research community needs for EHR vendors to buy into the technology and collaborate with NLP researchers. Another important aspect of success is educating clinicians about the systems that target their activity; many clinicians indicated lack of information about CDSSs  in their orientation as a factor contributing to their delayed use of these systems. One of the missing elements in measuring success is the lack of appropriate evaluation metrics. Surveys are widely used to study clinicians’ satisfaction with systems, but practical measures of the impact on healthcare outcomes still need to be developed.
Although we see an increasing use of publicly available tools, the pipelines that use the tools for identical purposes at different institutions, e.g., ejection fraction detection or adenoma detection, are still sometimes programmed at the institutions. Some progress has been made in porting clinical models and NLP methods [104, 105], however more work on porting pipelines with easy domain adaption needs to be done.
Seeing that community-wide challenges and the datasets they make publicly available, such as i2b2, ShARe, THYME, and TREC, do facilitate fundamental research, we need more and larger publicly available clinical text collections. We appreciate the individual researcher’s efforts to make data-sets and code more available as well, and hope to see even more sharing in the future.
Overall, we see three directions in clinical NLP development: patterns to share for simple tasks, more sophisticated methods yet to be developed for more complex tasks, and tasks that have yet to be addressed, and therefore are of unknown complexity. We also see that the latest NLP methods are not used in applications: they are explored, published, and shelved. We hope the worthy new methods will get more attention in being seen through to practice. More NLP research is needed to support meeting quality measures and health information exchange and interoperability.
In the realm of identifying and processing health-related texts in social media, we see that some researchers stay within the analysis realm, but it is interesting to see a growing number of publications aspiring to interventions based on the real-time processing of online consumer-generated texts. Most of text processing in this area is using very simple, yet effective techniques from an NLP standpoint.
This work was in part supported by the Intramural Research Program of the NIH, National Library of Medicine (DDF) and award R01 GM114355 from the National Institute of General Medical Studies (NE).
We thank the anonymous reviewers and the section editors for the encouragement, thorough reviews, and helpful suggestions.