Search tips
Search criteria

Results 1-25 (1165430)

Clipboard (0)

Related Articles

1.  Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning 
PLoS ONE  2012;7(1):e30412.
Electronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually.
To develop an algorithm to identify relevant free texts automatically based on labelled examples.
We developed a novel machine learning algorithm, the ‘Semi-supervised Set Covering Machine’ (S3CM), and tested its ability to detect the presence of coronary angiogram results and ovarian cancer diagnoses in free text in the General Practice Research Database. For training the algorithm, we used texts classified as positive and negative according to their associated Read diagnostic codes, rather than by manual annotation. We evaluated the precision (positive predictive value) and recall (sensitivity) of S3CM in classifying unlabelled texts against the gold standard of manual review. We compared the performance of S3CM with the Transductive Vector Support Machine (TVSM), the original fully-supervised Set Covering Machine (SCM) and our ‘Freetext Matching Algorithm’ natural language processor.
Only 60% of texts with Read codes for angiogram actually contained angiogram results. However, the S3CM algorithm achieved 87% recall with 64% precision on detecting coronary angiogram results, outperforming the fully-supervised SCM (recall 78%, precision 60%) and TSVM (recall 2%, precision 3%). For ovarian cancer diagnoses, S3CM had higher recall than the other algorithms tested (86%). The Freetext Matching Algorithm had better precision than S3CM (85% versus 74%) but lower recall (62%).
Our novel S3CM machine learning algorithm effectively detected free texts in primary care records associated with angiogram results and ovarian cancer diagnoses, after training on pre-classified test sets. It should be easy to adapt to other disease areas as it does not rely on linguistic rules, but needs further testing in other electronic health record datasets.
PMCID: PMC3261909  PMID: 22276193
2.  Identification of pneumonia and influenza deaths using the death certificate pipeline 
Death records are a rich source of data, which can be used to assist with public surveillance and/or decision support. However, to use this type of data for such purposes it has to be transformed into a coded format to make it computable. Because the cause of death in the certificates is reported as free text, encoding the data is currently the single largest barrier of using death certificates for surveillance. Therefore, the purpose of this study was to demonstrate the feasibility of using a pipeline, composed of a detection rule and a natural language processor, for the real time encoding of death certificates using the identification of pneumonia and influenza cases as an example and demonstrating that its accuracy is comparable to existing methods.
A Death Certificates Pipeline (DCP) was developed to automatically code death certificates and identify pneumonia and influenza cases. The pipeline used MetaMap to code death certificates from the Utah Department of Health for the year 2008. The output of MetaMap was then accessed by detection rules which flagged pneumonia and influenza cases based on the Centers of Disease and Control and Prevention (CDC) case definition. The output from the DCP was compared with the current method used by the CDC and with a keyword search. Recall, precision, positive predictive value and F-measure with respect to the CDC method were calculated for the two other methods considered here. The two different techniques compared here with the CDC method showed the following recall/ precision results: DCP: 0.998/0.98 and keyword searching: 0.96/0.96. The F-measure were 0.99 and 0.96 respectively (DCP and keyword searching). Both the keyword and the DCP can run in interactive form with modest computer resources, but DCP showed superior performance.
The pipeline proposed here for coding death certificates and the detection of cases is feasible and can be extended to other conditions. This method provides an alternative that allows for coding free-text death certificates in real time that may increase its utilization not only in the public health domain but also for biomedical researchers and developers.
Trial Registration
This study did not involved any clinical trials.
PMCID: PMC3444937  PMID: 22569097
Public health informatics; Natural language processing; Surveillance; Pneumonia and influenza
3.  De-identification of primary care electronic medical records free-text data in Ontario, Canada 
Electronic medical records (EMRs) represent a potentially rich source of health information for research but the free-text in EMRs often contains identifying information. While de-identification tools have been developed for free-text, none have been developed or tested for the full range of primary care EMR data
We used deid open source de-identification software and modified it for an Ontario context for use on primary care EMR data. We developed the modified program on a training set of 1000 free-text records from one group practice and then tested it on two validation sets from a random sample of 700 free-text EMR records from 17 different physicians from 7 different practices in 5 different cities and 500 free-text records from a group practice that was in a different city than the group practice that was used for the training set. We measured the sensitivity/recall, precision, specificity, accuracy and F-measure of the modified tool against manually tagged free-text records to remove patient and physician names, locations, addresses, medical record, health card and telephone numbers.
We found that the modified training program performed with a sensitivity of 88.3%, specificity of 91.4%, precision of 91.3%, accuracy of 89.9% and F-measure of 0.90. The validations sets had sensitivities of 86.7% and 80.2%, specificities of 91.4% and 87.7%, precisions of 91.1% and 87.4%, accuracies of 89.0% and 83.8% and F-measures of 0.89 and 0.84 for the first and second validation sets respectively.
The deid program can be modified to reasonably accurately de-identify free-text primary care EMR records while preserving clinical content.
PMCID: PMC2907300  PMID: 20565894
4.  Extracting Surveillance Data from Templated Sections of an Electronic Medical Note: Challenges and Opportunities 
To highlight the importance of templates in extracting surveillance data from the free text of electronic medical records using natural language processing (NLP) techniques.
The main stay of recording patient data is the free text of electronic medical records (EMR). While stating the chief complaint and history of presenting illness in the patients ‘own words’, the rest of the electronic note is written by the provider in their words. Providers often use boiler-plate templates from EMR pull-downs to document information on the patient in the form of checklists, check boxes, yes/no and free text responses to questions. When these templates are used for recording symptoms, demographic information or medical, social or travel history, they represent an important source of surveillance data [1]. There is a dearth of literature on the use of natural language processing in extracting data from templates in the EMR.
A corpus of 1000 free text medical notes from the VA integrated electronic medical record (CPRS) was reviewed to identify commonly used templates. Of these, 500 were enriched for the surveillance domain of interest for this project (homelessness). The other 500 were randomly sampled from a large corpus of electronic notes. An NLP algorithm was developed to extract concepts related to our target surveillance domain. A manual review of the notes was performed by three human reviewers to generate a document-level reference standard that classified this set of documents as either demonstrating evidence of homelessness (H) or not (NH). A rule-based NLP algorithm was developed that used a combination of key word searches and negation based on an extensive lexicon of terms developed for this purpose. A random sample of 50 documents each of H and NH documents were reviewed after each iteration of the NLP algorithm to determine the false positive rate of the extracted concepts.
The corpus consisted of 48% H and 52% NH documents as determined by human review. The NLP algorithm successfully extracted concepts from these documents. The H set had an average of 8 concepts related to homelessness per document (median 8, range 1 to 34). The NH set had an average 2 concepts (median 1, range 1 to 13)”. Thirteen template patterns were identified in this set of documents. The three most common were check boxes with square brackets, Yes/No and free text answer after a question. Several positively and negatively asserted concepts were noted to be in the responses to templated questions such as “Are you currently homeless: Yes or No”; “How many times have you been homeless in the past 3 years: (free text response)”; “Have you ever been in jail? [Y] or [N]”; Are you in need of substance abuse services? Yes or No”. Human review of a random sample of documents at the concept level indicated that the NLP algorithm generated 28% false positives in extracting concepts related to homelessness when templates were ignored among the H documents. When the algorithm was refined to include templates, the false positive rate declined to 22%. For the NH documents, the corresponding false positive rates were 56% and 21%.
To our knowledge, this is one of the first attempts to address the problem of information extraction from templates or templated sections of the EMR. A key challenge of templates is that they will most likely lead to poor performance of NLP algorithms and cause bottlenecks in processing if they are not considered. Acknowledging the presence of templates and refining NLP algorithms to handle them improves information extraction from free text medical notes, thus creating an opportunity for improved surveillance using the EMR. Algorithms will likely need to be customized to the electronic medical record and the surveillance domain of interest. A more detailed analysis of the templated sections is underway.
PMCID: PMC3692923
natural language processing; surveillance; templates; VA
5.  Comparing Natural Language Processing Tools to Extract Medical Problems from Narrative Text 
To help maintain a complete, accurate and timely Problem List, we are developing a system to automatically retrieve medical problems from free-text documents. This system uses Natural Language Processing to analyze all electronic narrative text documents in a patient’s record. Here we evaluate and compare 3 different applications of NLP technology in our system: the first using MMTx (MetaMap Transfer) with a negation detection algorithm (NegEx), the second using an alpha version of a locally developed NLP application called MPLUS2, and the third using keyword searching. They were adapted and trained to extract medical problems from a set of 80 problems of diagnosis type. The version using MMTx and NegEx was improved by adding some disambiguation and modifying the negation detection algorithm, and these modifications significantly improved recall and precision. The different versions of the NLP module were compared, and showed the following recall / precision results: standard MMTx with NegEx version 0.775 / 0.398; improved MMTx with NegEx version 0.892 / 0.753; MPLUS2 version 0.693 / 0.402; and keyword searching version 0.575 / 0.807. Average results for the reviewers were a recall of 0.788 and a precision of 0.912.
PMCID: PMC1560561  PMID: 16779095
6.  Using free text information to explore how and when GPs code a diagnosis of ovarian cancer: an observational study using primary care records of patients with ovarian cancer 
BMJ Open  2011;1(1):e000025.
Primary care databases provide a unique resource for healthcare research, but most researchers currently use only the Read codes for their studies, ignoring information in the free text, which is much harder to access.
To investigate how much information on ovarian cancer diagnosis is ‘hidden’ in the free text and the time lag between a diagnosis being described in the text or in a hospital letter and the patient being given a Read code for that diagnosis.
Anonymised free text records from the General Practice Research Database of 344 women with a Read code indicating ovarian cancer between 1 June 2002 and 31 May 2007 were used to compare the date at which the diagnosis was first coded with the date at which the diagnosis was recorded in the free text. Free text relating to a diagnosis was identified (a) from the date of coded diagnosis and (b) by searching for words relating to the ovary.
90% of cases had information relating to their ovary in the free text. 45% had text indicating a definite diagnosis of ovarian cancer. 22% had text confirming a diagnosis before the coded date; 10% over 4 weeks previously. Four patients did not have ovarian cancer and 10% had only ambiguous or suspected diagnoses associated with the ovarian cancer code.
There was a vast amount of extra information relating to diagnoses in the free text. Although in most cases text confirmed the coded diagnosis, it also showed that in some cases GPs do not code a definite diagnosis on the date that it is confirmed. For diseases which rely on hospital consultants for diagnosis, free text (particularly letters) is invaluable for accurate dating of diagnosis and referrals and also for identifying misclassified cases.
Article summary
Article focus
How much information on ovarian cancer diagnoses is ‘hidden’ in the free text of primary care records?
How accurate is the date of diagnosis based only on Read codes?
How many cases might be misclassified if codes alone are used to identify diagnoses?
Key messages
Free text contains much extra information on ovarian cancer diagnoses, including the dates on which the patient was investigated and diagnosed in secondary care.
This information can be used to determine the date at which a diagnosis was notified to the GP and to identify cases that have not been coded.
For certain disease areas, particularly where specialist care is involved, free text should be used to determine the extent of misclassification associated with both the (coded) date of diagnosis and identification of cases.
Strengths and limitations of this study
An in-depth analysis of information relating to ovarian cancer diagnoses using free text records from a large primary care database.
We did not have access to letters that had been scanned in as images, so will have missed some important information.
We only looked at cases which had been assigned an unambiguous Read code for ovarian cancer and thus will have missed cases with no code or an ambiguous code.
We ignored text that did not explicitly refer to the patient's ovaries and thus did not investigate pathways of care or symptoms. This is the topic of a separate study which is already underway.
We only looked at ovarian cancer, and cannot say whether our findings can be generalised to other diseases.
PMCID: PMC3191398  PMID: 22021731
Electronic patient records; survey data; non-response bias in surveys; multivariate statistics; misclassification bias
7.  The Effectiveness of Mobile-Health Technology-Based Health Behaviour Change or Disease Management Interventions for Health Care Consumers: A Systematic Review 
PLoS Medicine  2013;10(1):e1001362.
Caroline Free and colleagues systematically review a fast-moving field, that of the effectiveness of mobile technology interventions delivered to healthcare consumers, and conclude that high-quality, adequately powered trials of optimized interventions are required to evaluate effects on objective outcomes.
Mobile technologies could be a powerful media for providing individual level support to health care consumers. We conducted a systematic review to assess the effectiveness of mobile technology interventions delivered to health care consumers.
Methods and Findings
We searched for all controlled trials of mobile technology-based health interventions delivered to health care consumers using MEDLINE, EMBASE, PsycINFO, Global Health, Web of Science, Cochrane Library, UK NHS HTA (Jan 1990–Sept 2010). Two authors extracted data on allocation concealment, allocation sequence, blinding, completeness of follow-up, and measures of effect. We calculated effect estimates and used random effects meta-analysis. We identified 75 trials. Fifty-nine trials investigated the use of mobile technologies to improve disease management and 26 trials investigated their use to change health behaviours. Nearly all trials were conducted in high-income countries. Four trials had a low risk of bias. Two trials of disease management had low risk of bias; in one, antiretroviral (ART) adherence, use of text messages reduced high viral load (>400 copies), with a relative risk (RR) of 0.85 (95% CI 0.72–0.99), but no statistically significant benefit on mortality (RR 0.79 [95% CI 0.47–1.32]). In a second, a PDA based intervention increased scores for perceived self care agency in lung transplant patients. Two trials of health behaviour management had low risk of bias. The pooled effect of text messaging smoking cessation support on biochemically verified smoking cessation was (RR 2.16 [95% CI 1.77–2.62]). Interventions for other conditions showed suggestive benefits in some cases, but the results were not consistent. No evidence of publication bias was demonstrated on visual or statistical examination of the funnel plots for either disease management or health behaviours. To address the limitation of the older search, we also reviewed more recent literature.
Text messaging interventions increased adherence to ART and smoking cessation and should be considered for inclusion in services. Although there is suggestive evidence of benefit in some other areas, high quality adequately powered trials of optimised interventions are required to evaluate effects on objective outcomes.
Please see later in the article for the Editors' Summary
Editors’ Summary
Every year, millions of people die from cardiovascular diseases (diseases of the heart and circulation), chronic obstructive pulmonary disease (a long-term lung disease), lung cancer, HIV infection, and diabetes. These diseases are increasingly important causes of mortality (death) in low- and middle-income countries and are responsible for nearly 40% of deaths in high-income countries. For all these diseases, individuals can adopt healthy behaviors that help prevent disease onset. For example, people can lower their risk of diabetes and cardiovascular disease by maintaining a healthy body weight, and, if they are smokers, they can reduce their risk of lung cancer and cardiovascular disease by giving up cigarettes. In addition, optimal treatment of existing diseases can reduce mortality and morbidity (illness). Thus, in people who are infected with HIV, antiretroviral therapy delays the progression of HIV infection and the onset of AIDS, and in people who have diabetes, good blood sugar control can prevent retinopathy (a type of blindness) and other serious complications of diabetes.
Why Was This Study Done?
Health-care providers need effective ways to encourage "health-care consumers" to make healthy lifestyle choices and to self-manage chronic diseases. The amount of information, encouragement and support that can be conveyed to individuals during face-to-face consultations or through traditional media such as leaflets is limited, but mobile technologies such as mobile phones and portable computers have the potential to transform the delivery of health messages. These increasingly popular technologies—more than two-thirds of the world's population now owns a mobile phone—can be used to deliver health messages to people anywhere and at the most relevant times. For example, smokers trying to quit smoking can be sent regular text messages to sustain their motivation, but can also use text messaging to request extra support when it is needed. But is "mHealth," the provision of health-related services using mobile communication technology, an effective way to deliver health messages to health-care consumers? In this systematic review (a study that uses predefined criteria to identify all the research on a given topic), the researchers assess the effectiveness of mobile technology-based health behavior change interventions and disease management interventions delivered to health-care consumers.
What Did the Researchers Do and Find?
The researchers identified 75 controlled trials (studies that compare the outcomes of people who do and do not receive an intervention) of mobile technology-based health interventions delivered to health-care consumers that met their predefined criteria. Twenty-six trials investigated the use of mobile technologies to change health behaviors, 59 investigated their use in disease management, most were of low quality, and nearly all were undertaken in high-income countries. In one high-quality trial that used text messages to improve adherence to antiretroviral therapy among HIV-positive patients in Kenya, the intervention significantly reduced the patients’ viral load but did not significantly reduce mortality (the observed reduction in deaths may have happened by chance). In two high-quality UK trials, a smoking intervention based on text messaging (txt2stop) more than doubled biochemically verified smoking cessation. Other lower-quality trials indicated that using text messages to encourage physical activity improved diabetes control but had no effect on body weight. Combined diet and physical activity text messaging interventions also had no effect on weight, whereas interventions for other conditions showed suggestive benefits in some but not all cases.
What Do These Findings Mean?
These findings provide mixed evidence for the effectiveness of health intervention delivery to health-care consumers using mobile technologies. Moreover, they highlight the need for additional high-quality controlled trials of this mHealth application, particularly in low- and middle-income countries. Specifically, the demonstration that text messaging interventions increased adherence to antiretroviral therapy in a low-income setting and increased smoking cessation in a high-income setting provides some support for the inclusion of these two interventions in health-care services in similar settings. However, the effects of these two interventions need to be established in other settings and their cost-effectiveness needs to be measured before they are widely implemented. Finally, for other mobile technology–based interventions designed to change health behaviors or to improve self-management of chronic diseases, the results of this systematic review suggest that the interventions need to be optimized before further trials are undertaken to establish their clinical benefits.
Additional Information
Please access these Web sites via the online version of this summary at
A related PLOS Medicine Research Article by Free et al. investigates the ability of mHealth technologies to improve health-care service delivery processes
Wikipedia has a page on mHealth (note: Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
mHealth: New horizons for health through mobile technologies is a global survey of mHealth prepared by the World Health Organization’s Global Observatory for eHealth (eHealth is health-care practice supported by electronic processes and communication)
The mHealth in Low-Resource Settings website, which is maintained by the Netherlands Royal Tropical Institute, provides information on the current use, potential, and limitations of mHealth in low-resource settings
More information about Txt2stop is available, the UK National Health Service Choices website provides an analysis of the Txt2stop trial and what its results mean, and the UK National Health Service Smokefree website provides a link to a Quit App for the iPhone
The US Centers for Disease Control and Prevention has launched a text messaging service that delivers regular health tips and alerts to mobile phones
PMCID: PMC3548655  PMID: 23349621
8.  Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies 
BMC Bioinformatics  2013;14:10.
The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining?
We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. aFor text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results.
Before applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.
PMCID: PMC3599108  PMID: 23323800
9.  Combining Free Text and Structured Electronic Medical Record Entries to Detect Acute Respiratory Infections 
PLoS ONE  2010;5(10):e13377.
The electronic medical record (EMR) contains a rich source of information that could be harnessed for epidemic surveillance. We asked if structured EMR data could be coupled with computerized processing of free-text clinical entries to enhance detection of acute respiratory infections (ARI).
A manual review of EMR records related to 15,377 outpatient visits uncovered 280 reference cases of ARI. We used logistic regression with backward elimination to determine which among candidate structured EMR parameters (diagnostic codes, vital signs and orders for tests, imaging and medications) contributed to the detection of those reference cases. We also developed a computerized free-text search to identify clinical notes documenting at least two non-negated ARI symptoms. We then used heuristics to build case-detection algorithms that best combined the retained structured EMR parameters with the results of the text analysis.
Principal Findings
An adjusted grouping of diagnostic codes identified reference ARI patients with a sensitivity of 79%, a specificity of 96% and a positive predictive value (PPV) of 32%. Of the 21 additional structured clinical parameters considered, two contributed significantly to ARI detection: new prescriptions for cough remedies and elevations in body temperature to at least 38°C. Together with the diagnostic codes, these parameters increased detection sensitivity to 87%, but specificity and PPV declined to 95% and 25%, respectively. Adding text analysis increased sensitivity to 99%, but PPV dropped further to 14%. Algorithms that required satisfying both a query of structured EMR parameters as well as text analysis disclosed PPVs of 52–68% and retained sensitivities of 69–73%.
Structured EMR parameters and free-text analyses can be combined into algorithms that can detect ARI cases with new levels of sensitivity or precision. These results highlight potential paths by which repurposed EMR information could facilitate the discovery of epidemics before they cause mass casualties.
PMCID: PMC2954790  PMID: 20976281
10.  Layout-aware text extraction from full-text PDF of scientific articles 
The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.
Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.
LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at
PMCID: PMC3441580  PMID: 22640904
11.  Exploring the role narrative free-text plays in discrepancies between physician coding and the InterVA regarding determination of malaria as cause of death, in a malaria holo-endemic region 
Malaria Journal  2012;11:51.
In countries where tracking mortality and clinical cause of death are not routinely undertaken, gathering verbal autopsies (VA) is the principal method of estimating cause of death. The most common method for determining probable cause of death from the VA interview is Physician-Certified Verbal Autopsy (PCVA). A recent alternative method to interpret Verbal Autopsy (InterVA) is a computer model using a Bayesian approach to derive posterior probabilities for causes of death, given an a priori distribution at population level and a set of interview-based indicators. The model uses the same input information as PCVA, with the exception of narrative text information, which physicians can consult but which were not inputted into the model. Comparing the results of physician coding with the model, large differences could be due to difficulties in diagnosing malaria, especially in holo-endemic regions. Thus, the aim of the study was to explore whether physicians' access to electronically unavailable narrative text helps to explain the large discrepancy in malaria cause-specific mortality fractions (CSMFs) in physician coding versus the model.
Free-texts of electronically available records (N = 5,649) were summarised and incorporated into the InterVA version 3 (InterVA-3) for three sub-groups: (i) a 10%-representative subsample (N = 493) (ii) records diagnosed as malaria by physicians and not by the model (N = 1035), and (iii) records diagnosed by the model as malaria, but not by physicians (N = 332). CSMF results before and after free-text incorporation were compared.
There were changes of between 5.5-10.2% between models before and after free-text incorporation. No impact on malaria CSMFs was seen in the representative sub-sample, but the proportion of malaria as cause of death increased in the physician sub-sample (2.7%) and saw a large decrease in the InterVA subsample (9.9%). Information on 13/106 indicators appeared at least once in the free-texts that had not been matched to any item in the structured, electronically available portion of the Nouna questionnaire.
Free-texts are helpful in gathering information not adequately captured in VA questionnaires, though access to free-text does not explain differences in physician and model determination of malaria as cause of death.
PMCID: PMC3359180  PMID: 22353802
Verbal autopsy; Malaria; Free-text; INDEPTH; Cause of death; Burkina Faso; Bayesian InterVA model
12.  Validity of diagnostic coding within the General Practice Research Database: a systematic review 
The British Journal of General Practice  2010;60(572):e128-e136.
The UK-based General Practice Research Database (GPRD) is a valuable source of longitudinal primary care records and is increasingly used for epidemiological research.
To conduct a systematic review of the literature on accuracy and completeness of diagnostic coding in the GPRD.
Design of study
Systematic review.
Six electronic databases were searched using search terms relating to the GPRD, in association with terms synonymous with validity, accuracy, concordance, and recording. A positive predictive value was calculated for each diagnosis that considered a comparison with a gold standard. Studies were also considered that compared the GPRD with other databases and national statistics.
A total of 49 papers are included in this review. Forty papers conducted validation of a clinical diagnosis in the GPRD. When assessed against a gold standard (validation using GP questionnaire, primary care medical records, or hospital correspondence), most of the diagnoses were accurately recorded in the patient electronic record. Acute conditions were not as well recorded, with positive predictive values lower than 50%. Twelve papers compared prevalence or consultation rates in the GPRD against other primary care databases or national statistics. Generally, there was good agreement between disease prevalence and consultation rates between the GPRD and other datasets; however, rates of diabetes and musculoskeletal conditions were underestimated in the GPRD.
Most of the diagnoses coded in the GPRD are well recorded. Researchers using the GPRD may want to consider how well the disease of interest is recorded before planning research, and consider how to optimise the identification of clinical events.
PMCID: PMC2828861  PMID: 20202356
database management systems; meta-analysis; sensitivity and specificity; systematic review
13.  Benchmarking Clinical Speech Recognition and Information Extraction: New Data, Methods, and Evaluations 
JMIR Medical Informatics  2015;3(2):e19.
Over a tenth of preventable adverse events in health care are caused by failures in information flow. These failures are tangible in clinical handover; regardless of good verbal handover, from two-thirds to all of this information is lost after 3-5 shifts if notes are taken by hand, or not at all. Speech recognition and information extraction provide a way to fill out a handover form for clinical proofing and sign-off.
The objective of the study was to provide a recorded spoken handover, annotated verbatim transcriptions, and evaluations to support research in spoken and written natural language processing for filling out a clinical handover form. This dataset is based on synthetic patient profiles, thereby avoiding ethical and legal restrictions, while maintaining efficacy for research in speech-to-text conversion and information extraction, based on realistic clinical scenarios. We also introduce a Web app to demonstrate the system design and workflow.
We experiment with Dragon Medical 11.0 for speech recognition and CRF++ for information extraction. To compute features for information extraction, we also apply CoreNLP, MetaMap, and Ontoserver. Our evaluation uses cross-validation techniques to measure processing correctness.
The data provided were a simulation of nursing handover, as recorded using a mobile device, built from simulated patient records and handover scripts, spoken by an Australian registered nurse. Speech recognition recognized 5276 of 7277 words in our 100 test documents correctly. We considered 50 mutually exclusive categories in information extraction and achieved the F1 (ie, the harmonic mean of Precision and Recall) of 0.86 in the category for irrelevant text and the macro-averaged F1 of 0.70 over the remaining 35 nonempty categories of the form in our 101 test documents.
The significance of this study hinges on opening our data, together with the related performance benchmarks and some processing software, to the research and development community for studying clinical documentation and language-processing. The data are used in the CLEFeHealth 2015 evaluation laboratory for a shared task on speech recognition.
PMCID: PMC4427705  PMID: 25917752
computer systems evaluation; data collection; information extraction; nursing records; patient handoff; records as topic; speech recognition software
14.  Automated Outcome Classification of Emergency Department Computed Tomography Imaging Reports 
Reliably abstracting outcomes from free-text electronic medical records remains a challenge. While automated classification of free text has been a popular medical informatics topic, performance validation using real-world clinical data has been limited. The two main approaches are linguistic (natural language processing [NLP]) and statistical (machine learning). The authors have developed a hybrid system for abstracting computed tomography (CT) reports for specified outcomes.
The objective was to measure performance of a hybrid NLP and machine learning system for automated outcome classification of emergency department (ED) CT imaging reports. The hypothesis was that such a system is comparable to medical personnel doing the data abstraction.
A secondary analysis was performed on a prior diagnostic imaging study on 3,710 blunt facial trauma victims. Staff radiologists dictated CT reports as free text, which were then deidentified. A trained data abstractor manually coded the reference standard outcome of acute orbital fracture, with a random subset double-coded for reliability. The data set was randomly split evenly into training and testing sets. Training patient reports were used as input to the Medical Language Extraction and Encoding (MedLEE) NLP tool to create structured output containing standardized medical terms and modifiers for certainty and temporal status. Findings were filtered for low certainty and past/future modifiers and then combined with the manual reference standard to generate decision tree classifiers using data mining tools Waikato Environment for Knowledge Analysis (WEKA) 3.7.5 and Salford Predictive Miner 6.6. Performance of decision tree classifiers was evaluated on the testing set with or without NLP processing.
The performance of machine learning alone was comparable to prior NLP studies (sensitivity = 0.92, specificity = 0.93, precision = 0.95, recall = 0.93, f-score = 0.94), and the combined use of NLP and machine learning shows further improvement (sensitivity = 0.93, specificity = 0.97, precision = 0.97, recall = 0.96, f-score = 0.97). This performance is similar to, or better than, that of medical personnel in previous studies.
A hybrid NLP and machine learning automated classification system shows promise in coding free-text electronic clinical data.
PMCID: PMC3898888  PMID: 24033628
15.  Computerized Text Analysis to Enhance Automated Pneumonia Detection 
To improve the surveillance for pneumonia using the free-text of electronic medical records (EMR).
Information about disease severity could help with both detection and situational awareness during outbreaks of acute respiratory infections (ARI). In this work, we use data from the EMR to identify patients with pneumonia, a key landmark of ARI severity. We asked if computerized analysis of the free-text of clinical notes or imaging reports could complement structured EMR data to uncover pneumonia cases.
A previously validated ARI case-detection algorithm (CDA) (sensitivity, 99%; PPV, 14%) [1] flagged VAMHCS outpatient visits with associated chest imaging (n = 2737). Manually categorized imaging reports (Non-Negative if they could support the diagnosis of pneumonia, Negative otherwise; kappa = 0.88), served as a reference for the development of an automated report classifier through machine-learning [2]. EMR entries related to visits with Non-Negative chest imaging were manually reviewed to identify cases with Possible Pneumonia (new symptom(s) of cough, sputum, fever/chills/night sweats, dyspnea, pleuritic chest pain) or with Pneumonia-in-Plan (pneumonia listed as one of two most likely diagnoses in a physician’s note). These cases were used as reference for the development of the EMR-based CDAs. CDA components included ICD-9 codes for the full spectrum of ARI [1] or for the pneumonia subset, text analysis aimed at non-negated ARI symptoms in the clinical note [1] and the above-mentioned imaging report text classifier.
The manual review identified 370 reference cases with Possible Pneumonia and 250 with Pneumonia-in-Plan. Statistical performance for illustrative CDAs that combined structured EMR parameters with or without text analyses are shown in the Table. Addition of the “Text of Imaging Report” analyses increased PPV by 38–70% in absolute terms. Despite attendant losses in sensitivity, this classifier increased the F-Measure of all CDAs based on a broad ARI ICD-9 codeset. With the possible exception is CDA 6, whose F-measure was the highest achieved in this study, the text analysis seeking ARI symptoms in the clinical note did not add further value to those CDAs that also included analyses of the chest imaging reports.
Automated text analysis of chest imaging reports can improve our ability to separate outpatients with pneumonia from those with a milder form of ARI.
PMCID: PMC3692922
situational awareness; influenza; surveillance; electronic medical record; pneumonia
16.  “Understanding” Medical School Curriculum Content Using KnowledgeMap 
Objective: To describe the development and evaluation of computational tools to identify concepts within medical curricular documents, using information derived from the National Library of Medicine's Unified Medical Language System (UMLS). The long-term goal of the KnowledgeMap (KM) project is to provide faculty and students with an improved ability to develop, review, and integrate components of the medical school curriculum.
Design: The KM concept identifier uses lexical resources partially derived from the UMLS (SPECIALIST lexicon and Metathesaurus), heuristic language processing techniques, and an empirical scoring algorithm. KM differentiates among potentially matching Metathesaurus concepts within a source document. The authors manually identified important “gold standard” biomedical concepts within selected medical school full-content lecture documents and used these documents to compare KM concept recognition with that of a known state-of-the-art “standard”—the National Library of Medicine's MetaMap program.
Measurements: The number of “gold standard” concepts in each lecture document identified by either KM or MetaMap, and the cause of each failure or relative success in a random subset of documents.
Results: For 4,281 “gold standard” concepts, MetaMap matched 78% and KM 82%. Precision for “gold standard” concepts was 85% for MetaMap and 89% for KM. The heuristics of KM accurately matched acronyms, concepts underspecified in the document, and ambiguous matches. The most frequent cause of matching failures was absence of target concepts from the UMLS Metathesaurus.
Conclusion: The prototypic KM system provided an encouraging rate of concept extraction for representative medical curricular texts. Future versions of KM should be evaluated for their ability to allow administrators, lecturers, and students to navigate through the medical curriculum to locate redundancies, find interrelated information, and identify omissions. In addition, the ability of KM to meet specific, personal information needs should be assessed.
PMCID: PMC181986  PMID: 12668688
17.  Information Extraction for Clinical Data Mining: A Mammography Case Study 
Breast cancer is the leading cause of cancer mortality in women between the ages of 15 and 54. During mammography screening, radiologists use a strict lexicon (BI-RADS) to describe and report their findings. Mammography records are then stored in a well-defined database format (NMD). Lately, researchers have applied data mining and machine learning techniques to these databases. They successfully built breast cancer classifiers that can help in early detection of malignancy. However, the validity of these models depends on the quality of the underlying databases. Unfortunately, most databases suffer from inconsistencies, missing data, inter-observer variability and inappropriate term usage. In addition, many databases are not compliant with the NMD format and/or solely consist of text reports. BI-RADS feature extraction from free text and consistency checks between recorded predictive variables and text reports are crucial to addressing this problem.
We describe a general scheme for concept information retrieval from free text given a lexicon, and present a BI-RADS features extraction algorithm for clinical data mining. It consists of a syntax analyzer, a concept finder and a negation detector. The syntax analyzer preprocesses the input into individual sentences. The concept finder uses a semantic grammar based on the BI-RADS lexicon and the experts’ input. It parses sentences detecting BI-RADS concepts. Once a concept is located, a lexical scanner checks for negation. Our method can handle multiple latent concepts within the text, filtering out ultrasound concepts. On our dataset, our algorithm achieves 97.7% precision, 95.5% recall and an F1-score of 0.97. It outperforms manual feature extraction at the 5% statistical significance level.
PMCID: PMC3676897  PMID: 23765123
BI-RADS; free text; lexicon; mammography; clinical data mining
18.  Generation of Silver Standard Concept Annotations from Biomedical Texts with Special Relevance to Phenotypes 
PLoS ONE  2015;10(1):e0116040.
Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES), the National Center for Biomedical Ontology (NCBO) Annotator, the Biomedical Concept Annotation System (BeCAS) and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems’ output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74%) and their quality (best F1-measure of 33%), independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%), the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems’ annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from The textual content of the ShARe/CLEF ( and i2b2 ( corpora needs to be requested with the individual corpus providers.
PMCID: PMC4301805  PMID: 25607983
19.  A Comparison of Cost Effectiveness Using Data from Randomized Trials or Actual Clinical Practice: Selective Cox-2 Inhibitors as an Example 
PLoS Medicine  2009;6(12):e1000194.
Tjeerd-Pieter van Staa and colleagues estimate the likely cost effectiveness of selective Cox-2 inhibitors prescribed during routine clinical practice, as compared to the cost effectiveness predicted from randomized controlled trial data.
Data on absolute risks of outcomes and patterns of drug use in cost-effectiveness analyses are often based on randomised clinical trials (RCTs). The objective of this study was to evaluate the external validity of published cost-effectiveness studies by comparing the data used in these studies (typically based on RCTs) to observational data from actual clinical practice. Selective Cox-2 inhibitors (coxibs) were used as an example.
Methods and Findings
The UK General Practice Research Database (GPRD) was used to estimate the exposure characteristics and individual probabilities of upper gastrointestinal (GI) events during current exposure to nonsteroidal anti-inflammatory drugs (NSAIDs) or coxibs. A basic cost-effectiveness model was developed evaluating two alternative strategies: prescription of a conventional NSAID or coxib. Outcomes included upper GI events as recorded in GPRD and hospitalisation for upper GI events recorded in the national registry of hospitalisations (Hospital Episode Statistics) linked to GPRD. Prescription costs were based on the prescribed number of tables as recorded in GPRD and the 2006 cost data from the British National Formulary. The study population included over 1 million patients prescribed conventional NSAIDs or coxibs. Only a minority of patients used the drugs long-term and daily (34.5% of conventional NSAIDs and 44.2% of coxibs), whereas coxib RCTs required daily use for at least 6–9 months. The mean cost of preventing one upper GI event as recorded in GPRD was US$104k (ranging from US$64k with long-term daily use to US$182k with intermittent use) and US$298k for hospitalizations. The mean costs (for GPRD events) over calendar time were US$58k during 1990–1993 and US$174k during 2002–2005. Using RCT data rather than GPRD data for event probabilities, the mean cost was US$16k with the VIGOR RCT and US$20k with the CLASS RCT.
The published cost-effectiveness analyses of coxibs lacked external validity, did not represent patients in actual clinical practice, and should not have been used to inform prescribing policies. External validity should be an explicit requirement for cost-effectiveness analyses.
Please see later in the article for the Editors' Summary
Editors' Summary
Before a new treatment for a specific disease becomes an established part of clinical practice, it goes through a long process of development and clinical testing. This process starts with extensive studies of the new treatment in the laboratory and in animals and then moves into clinical trials. The most important of these trials are randomized controlled trials (RCTs), studies in which the efficacy and safety of the new drug and an established drug are compared by giving the two drugs to randomized groups of patients with the disease. The final hurdle that a drug or any other healthcare technology often has to jump before being adopted for widespread clinical use is a health technology assessment, which aims to provide policymakers, clinicians, and patients with information about the balance between the clinical and financial costs of the drug and its benefits (its cost-effectiveness). In England and Wales, for example, the National Institute for Health and Clinical Excellence (NICE), which promotes clinical excellence and the effective use of resources within the National Health Service, routinely commissions such assessments.
Why Was This Study Done?
Data on the risks of various outcomes associated with a new treatment are needed for cost-effectiveness analyses. These data are usually obtained from RCTs, but although RCTs are the best way of determining a drug's potency in experienced hands under ideal conditions (its efficacy), they may not be a good way to determine a drug's success in an average clinical setting (its effectiveness). In this study, the researchers compare the data from RCTs that have been used in several published cost-effectiveness analyses of a class of drugs called selective cyclooxygenase-2 inhibitors (“coxibs”) with observational data from actual clinical practice. They then ask whether the published cost-effectiveness studies, which generally used RCT data, should have been used to inform coxib prescribing policies. Coxibs are nonsteroidal anti-inflammatory drugs (NSAIDs) that were developed in the 1990s to treat arthritis and other chronic inflammatory conditions. Conventional NSAIDs can cause gastric ulcers and bleeding from the gut (upper gastrointestinal events) if taken for a long time. The use of coxibs avoids this problem.
What Did the Researchers Do and Find?
The researchers extracted data on the real-life use of conventional NSAIDs and coxibs and on the incidence of upper gastrointestinal events from the UK General Practice Research Database (GPRD) and from the national registry of hospitalizations. Only a minority of the million patients who were prescribed conventional NSAIDs (average cost per prescription US$17.80) or coxibs (average cost per prescription US$47.04) for a variety of inflammatory conditions took them on a long-term daily basis, whereas in the RCTs of coxibs, patients with a few carefully defined conditions took NSAIDs daily for at least 6–9 months. The researchers then developed a cost-effectiveness model to evaluate the costs of the alternative strategies of prescribing a conventional NSAID or a coxib. The mean additional cost of preventing one gastrointestinal event recorded in the GPRD by using a coxib instead of a NSAID, they report, was US$104,000; the mean cost of preventing one hospitalization for such an event was US$298,000. By contrast, the mean cost of preventing one gastrointestinal event by using a coxib instead of a NSAID calculated from data obtained in RCTs was about US$20,000.
What Do These Findings Mean?
These findings suggest that the published cost-effectiveness analyses of coxibs greatly underestimate the cost of preventing gastrointestinal events by replacing prescriptions of conventional NSAIDs with prescriptions of coxibs. That is, if data from actual clinical practice had been used in cost-effectiveness analyses rather than data from RCTs, the conclusions of the published cost-effectiveness analyses of coxibs would have been radically different and may have led to different prescribing guidelines for this class of drug. More generally, these findings provide a good illustration of how important it is to ensure that cost-effectiveness analyses have “external” validity by using realistic estimates for event rates and costs rather than relying on data from RCTs that do not always reflect the real-world situation. The researchers suggest, therefore, that health technology assessments should move from evaluating cost-efficacy in ideal populations with ideal interventions to evaluating cost-effectiveness in real populations with real interventions.
Additional Information
Please access these Web sites via the online version of this summary at
The UK National Institute for Health Research provides information about health technology assessment
The National Institute for Health and Clinical Excellence Web site describes how this organization provides guidance on promoting good health within the England and Wales National Health Service
Information on the UK General Practice Research Database is available
Wikipedia has pages on health technology assessment and on selective cyclooxygenase-2 inhibitors (note that Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
PMCID: PMC2779340  PMID: 19997499
20.  Optimising the use of electronic health records to estimate the incidence of rheumatoid arthritis in primary care: what information is hidden in free text? 
Primary care databases are a major source of data for epidemiological and health services research. However, most studies are based on coded information, ignoring information stored in free text. Using the early presentation of rheumatoid arthritis (RA) as an exemplar, our objective was to estimate the extent of data hidden within free text, using a keyword search.
We examined the electronic health records (EHRs) of 6,387 patients from the UK, aged 30 years and older, with a first coded diagnosis of RA between 2005 and 2008. We listed indicators for RA which were present in coded format and ran keyword searches for similar information held in free text. The frequency of indicator code groups and keywords from one year before to 14 days after RA diagnosis were compared, and temporal relationships examined.
One or more keyword for RA was found in the free text in 29% of patients prior to the RA diagnostic code. Keywords for inflammatory arthritis diagnoses were present for 14% of patients whereas only 11% had a diagnostic code. Codes for synovitis were found in 3% of patients, but keywords were identified in an additional 17%. In 13% of patients there was evidence of a positive rheumatoid factor test in text only, uncoded. No gender differences were found. Keywords generally occurred close in time to the coded diagnosis of rheumatoid arthritis. They were often found under codes indicating letters and communications.
Potential cases may be missed or wrongly dated when coded data alone are used to identify patients with RA, as diagnostic suspicions are frequently confined to text. The use of EHRs to create disease registers or assess quality of care will be misleading if free text information is not taken into account. Methods to facilitate the automated processing of text need to be developed and implemented.
PMCID: PMC3765394  PMID: 23964710
Electronic health records; Electronic medical records; Rheumatoid arthritis; Free text; Coding
21.  Extracting antipsychotic polypharmacy data from electronic health records: developing and evaluating a novel process 
BMC Psychiatry  2015;15:166.
Antipsychotic prescription information is commonly derived from structured fields in clinical health records. However, utilising diverse and comprehensive sources of information is especially important when investigating less frequent patterns of medication prescribing such as antipsychotic polypharmacy (APP). This study describes and evaluates a novel method of extracting APP data from both structured and free-text fields in electronic health records (EHRs), and its use for research purposes.
Using anonymised EHRs, we identified a cohort of patients with serious mental illness (SMI) who were treated in South London and Maudsley NHS Foundation Trust mental health care services between 1 January and 30 June 2012. Information about antipsychotic co-prescribing was extracted using a combination of natural language processing and a bespoke algorithm. The validity of the data derived through this process was assessed against a manually coded gold standard to establish precision and recall. Lastly, we estimated the prevalence and patterns of antipsychotic polypharmacy.
Individual instances of antipsychotic prescribing were detected with high precision (0.94 to 0.97) and moderate recall (0.57-0.77). We detected baseline APP (two or more antipsychotics prescribed in any 6-week window) with 0.92 precision and 0.74 recall and long-term APP (antipsychotic co-prescribing for 6 months) with 0.94 precision and 0.60 recall. Of the 7,201 SMI patients receiving active care during the observation period, 338 (4.7 %; 95 % CI 4.2-5.2) were identified as receiving long-term APP. Two second generation antipsychotics (64.8 %); and first -second generation antipsychotics were most commonly co-prescribed (32.5 %).
These results suggest that this is a potentially practical tool for identifying polypharmacy from mental health EHRs on a large scale. Furthermore, extracted data can be used to allow researchers to characterize patterns of polypharmacy over time including different drug combinations, trends in polypharmacy prescribing, predictors of polypharmacy prescribing and the impact of polypharmacy on patient outcomes.
PMCID: PMC4511263  PMID: 26198696
Antipsychotic polypharmacy; Electronic health records; Precision; Recall
22.  Representation of Information about Family Relatives as Structured Data in Electronic Health Records 
Applied Clinical Informatics  2014;5(2):349-367.
The ability to manage and leverage family history information in the electronic health record (EHR) is crucial to delivering high-quality clinical care.
We aimed to evaluate existing standards in representing relative information, examine this information documented in EHRs, and develop a natural language processing (NLP) application to extract relative information from free-text clinical documents.
We reviewed a random sample of 100 admission notes and 100 discharge summaries of 198 patients, and also reviewed the structured entries for these patients in an EHR system’s family history module. We investigated the two standards used by Stage 2 of Meaningful Use (SNOMED CT and HL7 Family History Standard) and identified coverage gaps of each standard in coding relative information. Finally, we evaluated the performance of the MTERMS NLP system in identifying relative information from free-text documents.
The structure and content of SNOMED CT and HL7 for representing relative information are different in several ways. Both terminologies have high coverage to represent local relative concepts built in an ambulatory EHR system, but gaps in key concept coverage were detected; coverage rates for relative information in free-text clinical documents were 95.2% and 98.6%, respectively. Compared to structured entries, richer family history information was only available in free-text documents. Using a comprehensive lexicon that included concepts and terms of relative information from different sources, we expanded the MTERMS NLP system to extract and encode relative information in clinical documents and achieved a corresponding precision of 100% and recall of 97.4%.
Comprehensive assessment and user guidance are critical to adopting standards into EHR systems in a meaningful way. A significant portion of patients’ family history information is only documented in free-text clinical documents and NLP can be used to extract this information.
PMCID: PMC4081741  PMID: 25024754
Terminology; SNOMED; HL7; family history; electronic health records; natural language processing
23.  Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts 
PLoS Computational Biology  2011;7(8):e1002141.
Electronic patient records remain a rather unexplored, but potentially rich data source for discovering correlations between diseases. We describe a general approach for gathering phenotypic descriptions of patients from medical records in a systematic and non-cohort dependent manner. By extracting phenotype information from the free-text in such records we demonstrate that we can extend the information contained in the structured record data, and use it for producing fine-grained patient stratification and disease co-occurrence statistics. The approach uses a dictionary based on the International Classification of Disease ontology and is therefore in principle language independent. As a use case we show how records from a Danish psychiatric hospital lead to the identification of disease correlations, which subsequently can be mapped to systems biology frameworks.
Author Summary
Text mining and information extraction can be seen as the challenge of converting information hidden in text into manageable data. We have used text mining to automatically extract clinically relevant terms from 5543 psychiatric patient records and map these to disease codes in the International Classification of Disease ontology (ICD10). Mined codes were supplemented by existing coded data. For each patient we constructed a phenotypic profile of associated ICD10 codes. This allowed us to cluster patients together based on the similarity of their profiles. The result is a patient stratification based on more complete profiles than the primary diagnosis, which is typically used. Similarly we investigated comorbidities by looking for pairs of disease codes cooccuring in patients more often than expected. Our high ranking pairs were manually curated by a medical doctor who flagged 93 candidates as interesting. For a number of these we were able to find genes/proteins known to be associated with the diseases using the OMIM database. The disease-associated proteins allowed us to construct protein networks suspected to be involved in each of the phenotypes. Shared proteins between two associated diseases might provide insight to the disease comorbidity.
PMCID: PMC3161904  PMID: 21901084
24.  Use of the General Practice Research Database (GPRD) for respiratory epidemiology: a comparison with the 4th Morbidity Survey in General Practice (MSGP4) 
Thorax  1999;54(5):413-419.
BACKGROUND—The General Practice Research Database (GPRD) covers over 6% of the population of England and Wales and holds data on diagnoses and prescribing from 1987 onwards. Most previous studies using the GPRD have concentrated on drug use and safety. A study was undertaken to assess the validity of using the GPRD for epidemiological research into respiratory diseases.
METHODS—Age-specific and sex-specific rates derived from the GPRD for 11 respiratory conditions were compared with patient consultation rates from the 4th Morbidity Survey in General Practice (MSGP4). Within the GPRD comparisons were made between patient diagnosis rates, patient prescription rates, and patient "prescription plus relevant diagnosis" rates for selected treatments.
RESULTS—There was good agreement between consultation rates in the MSGP4 and diagnosis or "prescription plus diagnosis" from the GPRD in terms of pattern and magnitude, except for "acute bronchitis or bronchiolitis" where the best comparison was the combination category of "chest infection" and/or "acute bronchitis or bronchiolitis". Within the GPRD, patient prescription rates for inhalers, tuberculosis or hayfever therapy showed little similarity with diagnosis only rates but a similarity was seen with the combination of "prescription plus diagnosis" which may be a better reflection of morbidity than diagnosis alone.
CONCLUSIONS—The GPRD appears to be valid for primary care epidemiological studies by comparison with MSGP4 and offers advantages in terms of large size, a longer time period covered, and ability to link prescriptions with diagnoses. However, careful interpretation is needed because not all consultations are recorded and the coding system used contains terms which do not directly map to ICD codes.

PMCID: PMC1763769  PMID: 10212105
25.  Assessment of disease named entity recognition on a corpus of annotated sentences 
BMC Bioinformatics  2008;9(Suppl 3):S3.
In recent years, the recognition of semantic types from the biomedical scientific literature has been focused on named entities like protein and gene names (PGNs) and gene ontology terms (GO terms). Other semantic types like diseases have not received the same level of attention. Different solutions have been proposed to identify disease named entities in the scientific literature. While matching the terminology with language patterns suffers from low recall (e.g., Whatizit) other solutions make use of morpho-syntactic features to better cover the full scope of terminological variability (e.g., MetaMap). Currently, MetaMap that is provided from the National Library of Medicine (NLM) is the state of the art solution for the annotation of concepts from UMLS (Unified Medical Language System) in the literature. Nonetheless, its performance has not yet been assessed on an annotated corpus. In addition, little effort has been invested so far to generate an annotated dataset that links disease entities in text to disease entries in a database, thesaurus or ontology and that could serve as a gold standard to benchmark text mining solutions.
As part of our research work, we have taken a corpus that has been delivered in the past for the identification of associations of genes to diseases based on the UMLS Metathesaurus and we have reprocessed and re-annotated the corpus. We have gathered annotations for disease entities from two curators, analyzed their disagreement (0.51 in the kappa-statistic) and composed a single annotated corpus for public use. Thereafter, three solutions for disease named entity recognition including MetaMap have been applied to the corpus to automatically annotate it with UMLS Metathesaurus concepts. The resulting annotations have been benchmarked to compare their performance.
The annotated corpus is publicly available at and can serve as a benchmark to other systems. In addition, we found that dictionary look-up already provides competitive results indicating that the use of disease terminology is highly standardized throughout the terminologies and the literature. MetaMap generates precise results at the expense of insufficient recall while our statistical method obtains better recall at a lower precision rate. Even better results in terms of precision are achieved by combining at least two of the three methods leading, but this approach again lowers recall. Altogether, our analysis gives a better understanding of the complexity of disease annotations in the literature. MetaMap and the dictionary based approach are available through the Whatizit web service infrastructure (Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through Web services: Calling Whatizit. Bioinformatics 2008, 24:296-298).
PMCID: PMC2352871  PMID: 18426548

Results 1-25 (1165430)