Search tips
Search criteria

Results 1-5 (5)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Automatically Annotating Topics in Transcripts of Patient-Provider Interactions via Machine Learning 
Annotated patient-provider encounters can provide important insights into clinical communication, ultimately suggesting how it might be improved to effect better health outcomes. But annotating outpatient transcripts with Roter or General Medical Interaction Analysis System (GMIAS) codes is expensive, limiting the scope of such analyses. We propose automatically annotating transcripts of patient-provider interactions with topic codes via machine learning.
We use a conditional random field (CRF) to model utterance topic probabilities. The model accounts for the sequential structure of conversations and the words comprising utterances. We assess predictive performance via 10- fold cross-validation over GMIAS-annotated transcripts of 360 outpatient visits (over 230,000 utterances). We then used automated in place of manual annotations to reproduce an analysis of 116 additional visits from a randomized trial that used GMIAS to assess the efficacy of an intervention aimed at improving communication around antiretroviral (ARV) adherence.
With respect to six topic codes, the CRF achieved a mean pairwise kappa compared with human annotators of 0.49 (range: 0.47, 0.53) and a mean overall accuracy of 0.64 (range: 0.62, 0.66). With respect to the RCT re-analysis, results using automated annotations agreed with those obtained using manual ones. According to the manual annotations, the median number of ARV-related utterances without and with the intervention was 49.5 versus 76, respectively (paired sign test p=0.07). Using automated annotations, the respective numbers were 39 versus 55 (p=0.04).
While moderately accurate, the predicted annotations are far from perfect. Conversational topics are intermediate outcomes; their utility is still being researched.
This foray into automated topic inference suggests that machine learning methods can classify utterances comprising patient-provider interactions into clinically relevant topics with reasonable accuracy.
PMCID: PMC3991772  PMID: 24285151
2.  The COPD genetic association compendium: a comprehensive online database of COPD genetic associations 
Human Molecular Genetics  2009;19(3):526-534.
Chronic obstructive pulmonary disease (COPD) is a major cause of morbidity and mortality worldwide. COPD is thought to arise from the interaction of environmental exposures and genetic susceptibility, and major research efforts are underway to identify genetic determinants of COPD susceptibility. With the exception of SERPINA1, genetic associations with COPD identified by candidate gene studies have been inconsistently replicated, and this literature is difficult to interpret. We conducted a systematic review and meta-analysis of all population-based, case–control candidate gene COPD studies indexed in PubMed before 16 July 2008. We stored our findings in an online database, which serves as an up-to-date compendium of COPD genetic associations and cumulative meta-analysis estimates. On the basis of our systematic review, the vast majority of COPD candidate gene era studies are underpowered to detect genetic effect odds ratios of 1.2–1.5. We identified 27 genetic variants with adequate data for quantitative meta-analysis. Of these variants, four were significantly associated with COPD susceptibility in random effects meta-analysis, the GSTM1 null variant (OR 1.45, CI 1.09–1.92), rs1800470 in TGFB1 (0.73, CI 0.64–0.83), rs1800629 in TNF (OR 1.19, CI 1.01–1.40) and rs1799896 in SOD3 (OR 1.97, CI 1.24–3.13). In summary, most COPD candidate gene era studies are underpowered to detect moderate-sized genetic effects. Quantitative meta-analysis identified four variants in GSTM1, TGFB1, TNF and SOD3 that show statistically significant evidence of association with COPD susceptibility.
PMCID: PMC2798725  PMID: 19933216
3.  Semi-automated screening of biomedical citations for systematic reviews 
BMC Bioinformatics  2010;11:55.
Systematic reviews address a specific clinical question by unbiasedly assessing and analyzing the pertinent literature. Citation screening is a time-consuming and critical step in systematic reviews. Typically, reviewers must evaluate thousands of citations to identify articles eligible for a given review. We explore the application of machine learning techniques to semi-automate citation screening, thereby reducing the reviewers' workload.
We present a novel online classification strategy for citation screening to automatically discriminate "relevant" from "irrelevant" citations. We use an ensemble of Support Vector Machines (SVMs) built over different feature-spaces (e.g., abstract and title text), and trained interactively by the reviewer(s).
Semi-automating the citation screening process is difficult because any such strategy must identify all citations eligible for the systematic review. This requirement is made harder still due to class imbalance; there are far fewer "relevant" than "irrelevant" citations for any given systematic review. To address these challenges we employ a custom active-learning strategy developed specifically for imbalanced datasets. Further, we introduce a novel undersampling technique. We provide experimental results over three real-world systematic review datasets, and demonstrate that our algorithm is able to reduce the number of citations that must be screened manually by nearly half in two of these, and by around 40% in the third, without excluding any of the citations eligible for the systematic review.
We have developed a semi-automated citation screening algorithm for systematic reviews that has the potential to substantially reduce the number of citations reviewers have to manually screen, without compromising the quality and comprehensiveness of the review.
PMCID: PMC2824679  PMID: 20102628
4.  Meta-Analyst: software for meta-analysis of binary, continuous and diagnostic data 
Meta-analysis is increasingly used as a key source of evidence synthesis to inform clinical practice. The theory and statistical foundations of meta-analysis continually evolve, providing solutions to many new and challenging problems. In practice, most meta-analyses are performed in general statistical packages or dedicated meta-analysis programs.
Herein, we introduce Meta-Analyst, a novel, powerful, intuitive, and free meta-analysis program for the meta-analysis of a variety of problems. Meta-Analyst is implemented in C# atop of the Microsoft .NET framework, and features a graphical user interface. The software performs several meta-analysis and meta-regression models for binary and continuous outcomes, as well as analyses for diagnostic and prognostic test studies in the frequentist and Bayesian frameworks. Moreover, Meta-Analyst includes a flexible tool to edit and customize generated meta-analysis graphs (e.g., forest plots) and provides output in many formats (images, Adobe PDF, Microsoft Word-ready RTF). The software architecture employed allows for rapid changes to be made to either the Graphical User Interface (GUI) or to the analytic modules.
We verified the numerical precision of Meta-Analyst by comparing its output with that from standard meta-analysis routines in Stata over a large database of 11,803 meta-analyses of binary outcome data, and 6,881 meta-analyses of continuous outcome data from the Cochrane Library of Systematic Reviews. Results from analyses of diagnostic and prognostic test studies have been verified in a limited number of meta-analyses versus MetaDisc and MetaTest. Bayesian statistical analyses use the OpenBUGS calculation engine (and are thus as accurate as the standalone OpenBUGS software).
We have developed and validated a new program for conducting meta-analyses that combines the advantages of existing software for this task.
PMCID: PMC2795760  PMID: 19961608
5.  Toward modernizing the systematic review pipeline in genetics: efficient updating via data mining 
Genetics in Medicine  2012;14(7):663-669.
The aim of this study was to demonstrate that modern data mining tools can be used as one step in reducing the labor necessary to produce and maintain systematic reviews.
We used four continuously updated, manually curated resources that summarize MEDLINE-indexed articles in entire fields using systematic review methods (PDGene, AlzGene, and SzGene for genetic determinants of Parkinson disease, Alzheimer disease, and schizophrenia, respectively; and the Tufts Cost-Effectiveness Analysis (CEA) Registry for cost-effectiveness analyses). In each data set, we trained a classification model on citations screened up until 2009. We then evaluated the ability of the model to classify citations published in 2010 as “relevant” or “irrelevant” using human screening as the gold standard.
Classification models did not miss any of the 104, 65, and 179 eligible citations in PDGene, AlzGene, and SzGene, respectively, and missed only 1 of 79 in the CEA Registry (100% sensitivity for the first three and 99% for the fourth). The respective specificities were 90, 93, 90, and 73%. Had the semiautomated system been used in 2010, a human would have needed to read only 605/5,616 citations to update the PDGene registry (11%) and 555/7,298 (8%), 717/5,381 (13%), and 334/1,015 (33%) for the other three databases.
Data mining methodologies can reduce the burden of updating systematic reviews, without missing more papers than humans.
PMCID: PMC3908550  PMID: 22481134
citation screening; machine learning; meta-analysis; support vector machine; text classification

Results 1-5 (5)