1.  Sentiment Analysis of Suicide Notes: A Shared Task 
Biomedical informatics insights  2012;5(Suppl 1):3-16.
This paper reports on a shared task involving the assignment of emotions to suicide notes. Two features distinguished this task from previous shared tasks in the biomedical domain. One is that it resulted in the corpus of fully anonymized clinical text and annotated suicide notes. This resource is permanently available and will (we hope) facilitate future research. The other key feature of the task is that it required categorization with respect to a large set of labels. The number of participants was larger than in any previous biomedical challenge task. We describe the data production process and the evaluation measures, and give a preliminary analysis of the results. Many systems performed at levels approaching the inter-coder agreement, suggesting that human-like performance on this task is within the reach of currently available technologies.
PMCID: PMC3299408  PMID: 22419877
Sentiment analysis; suicide; suicide notes; natural language processing; computational linguistics; shared task; challenge 2011
2.  Comparison of LDA and SPRT on Clinical Dataset Classifications 
In this work, we investigate the well-known classification algorithm LDA as well as its close relative SPRT. SPRT affords many theoretical advantages over LDA. It allows specification of desired classification error rates α and β and is expected to be faster in predicting the class label of a new instance. However, SPRT is not as widely used as LDA in the pattern recognition and machine learning community. For this reason, we investigate LDA, SPRT and a modified SPRT (MSPRT) empirically using clinical datasets from Parkinson’s disease, colon cancer, and breast cancer. We assume the same normality assumption as LDA and propose variants of the two SPRT algorithms based on the order in which the components of an instance are sampled. Leave-one-out cross-validation is used to assess and compare the performance of the methods. The results indicate that two variants, SPRT-ordered and MSPRT-ordered, are superior to LDA in terms of prediction accuracy. Moreover, on average SPRT-ordered and MSPRT-ordered examine less components than LDA before arriving at a decision. These advantages imply that SPRT-ordered and MSPRT-ordered are the preferred algorithms over LDA when the normality assumption can be justified for a dataset.
PMCID: PMC3178328  PMID: 21949476
clinical data classification; linear discriminant analysis; sequential probability ratio test; supervised learning
3.  Suicide Note Classification Using Natural Language Processing: A Content Analysis 
Biomedical informatics insights  2010;2010(3):19-28.
Suicide is the second leading cause of death among 25–34 year olds and the third leading cause of death among 15–25 year olds in the United States. In the Emergency Department, where suicidal patients often present, estimating the risk of repeated attempts is generally left to clinical judgment. This paper presents our second attempt to determine the role of computational algorithms in understanding a suicidal patient’s thoughts, as represented by suicide notes. We focus on developing methods of natural language processing that distinguish between genuine and elicited suicide notes. We hypothesize that machine learning algorithms can categorize suicide notes as well as mental health professionals and psychiatric physician trainees do. The data used are comprised of suicide notes from 33 suicide completers and matched to 33 elicited notes from healthy control group members. Eleven mental health professionals and 31 psychiatric trainees were asked to decide if a note was genuine or elicited. Their decisions were compared to nine different machine-learning algorithms. The results indicate that trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time. This is an important step in developing an evidence-based predictor of repeated suicide attempts because it shows that natural language processing can aid in distinguishing between classes of suicidal notes.
PMCID: PMC3107011  PMID: 21643548
suicide; suicide prediction; suicide notes; machine learning
4.  A Method to Detect Differential Gene expression in Cross-Species Hybridization Experiments at Gene and Probe Level 
Biomedical informatics insights  2010;2010(3):1-10.
Whole genome microarrays are increasingly becoming the method of choice to study responses in model organisms to disease, stressors or other stimuli. However, whole genome sequences are available for only some model organisms, and there are still many species whose genome sequences are not yet available. Cross-species studies, where arrays developed for one species are used to study gene expression in a closely related species, have been used to address this gap, with some promising results. Current analytical methods have included filtration of some probes or genes that showed low hybridization activities. But consensus filtration schemes are still not available.
A novel masking procedure is proposed based on currently available target species sequences to filter out probes and study a cross-species data set using this masking procedure and gene-set analysis. Gene-set analysis evaluates the association of some priori defined gene groups with a phenotype of interest. Two methods, Gene Set Enrichment Analysis (GSEA) and Test of Test Statistics (ToTS) were investigated. The results showed that masking procedure together with ToTS method worked well in our data set. The results from an alternative way to study cross-species hybridization experiments without masking are also presented. We hypothesize that the multi-probes structure of Affymetrix microarrays makes it possible to aggregate the effects of both well-hybridized and poorly-hybridized probes to study a group of genes. The principles of gene-set analysis were applied to the probe-level data instead of gene-level data. The results showed that ToTS can give valuable information and thus can be used as a powerful technique for analyzing cross-species hybridization experiments.
Software in the form of R code is available at
PMCID: PMC2928260  PMID: 20798791
gene expression; cross-species; probes; genes; hybridization
5.  Current Challenge in Consumer Health Informatics: Bridging the Gap between Access to Information and Information Understanding 
The number of health-related websites has proliferated over the past few years. Health information consumers confront a myriad of health related resources on the internet that have varying levels of quality and are not always easy to comprehend. There is thus a need to help health information consumers to bridge the gap between access to information and information understanding—i.e. to help consumers understand health related web-based resources so that they can act upon it. At the same time health information consumers are becoming not only more involved in their own health care but also more information technology minded. One way to address this issue is to provide consumers with tailored information that is contextualized and personalized e.g. directly relevant and easily comprehensible to the person's own health situation. This paper presents a current trend in Consumer Health Informatics which focuses on theory-based design and development of contextualized and personalized tools to allow the evolving consumer with varying backgrounds and interests to use online health information efficiently. The proposed approach uses a theoretical framework of communication in order to support the consumer's capacity to understand health-related web-based resources.
PMCID: PMC2858407  PMID: 20419038
consumer health information; internet; contextualization of information; personalization
6.  Avoiding Pitfalls in the Statistical Analysis of Heterogeneous Tumors 
Information about tumors is usually obtained from a single assessment of a tumor sample, performed at some point in the course of the development and progression of the tumor, with patient characteristics being surrogates for natural history context. Differences between cells within individual tumors (intratumor heterogeneity) and between tumors of different patients (intertumor heterogeneity) may mean that a small sample is not representative of the tumor as a whole, particularly for solid tumors which are the focus of this paper. This issue is of increasing importance as high-throughput technologies generate large multi-feature data sets in the areas of genomics, proteomics, and image analysis. Three potential pitfalls in statistical analysis are discussed (sampling, cut-points, and validation) and suggestions are made about how to avoid these pitfalls.
PMCID: PMC2828739  PMID: 20191105
cancer; statistics; biomarkers; prognosis; heterogeneity
7.  Personalizing Drug Selection Using Advanced Clinical Decision Support 
This article describes the process of developing an advanced pharmacogenetics clinical decision support at one of the United States’ leading pediatric academic medical centers. This system, called CHRISTINE, combines clinical and genetic data to identify the optimal drug therapy when treating patients with epilepsy or Attention Deficit Hyperactivity Disorder. In the discussion a description of clinical decision support systems is provided, along with an overview of neurocognitive computing and how it is applied in this setting.
PMCID: PMC2773552  PMID: 19898682

