Clinical records include both coded and free-text fields that interact to reflect complicated patient stories. The information often covers not only the present medical condition and events experienced by the patient, but also refers to relevant events in the past (such as signs, symptoms, tests or treatments). In order to automatically construct a timeline of these events, we first need to extract the temporal relations between pairs of events or time expressions presented in the clinical notes. We designed separate extraction components for different types of temporal relations, utilizing a novel hybrid system that combines machine learning with a graph-based inference mechanism to extract the temporal links. The temporal graph is a directed graph based on parse tree dependencies of the simplified sentences and frequent pattern clues. We generalized the sentences in order to discover patterns that, given the complexities of natural language, might not be directly discoverable in the original sentences. The proposed hybrid system performance reached an F-measure of 0.63, with precision at 0.76 and recall at 0.54 on the 2012 i2b2 Natural Language Processing corpus for the temporal relation (TLink) extraction task, achieving the highest precision and third highest f-measure among participating teams in the TLink track.
Temporal relation extraction; Clinical text mining; Automatic patient timeline; Natural Language Processing; Machine learning; Temporal graph
Finding gene functions discussed in the literature is an important task of information extraction (IE) from biomedical documents. Automated computational methodologies can significantly reduce the need for manual curation and improve quality of other related IE systems. We propose an open-IE method for the BioCreative IV GO shared task (subtask b), focused on finding gene function terms [Gene Ontology (GO) terms] for different genes in an article. The proposed open-IE approach is based on distributional semantic similarity over the GO terms. The method does not require annotated data for training, which makes it highly generalizable. We achieve an F-measure of 0.26 on the test-set in the official submission for BioCreative-GO shared task, the third highest F-measure among the seven participants in the shared task.
Database URL: https://code.google.com/p/rainbow-nlp/
Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.
Women with Lynch syndrome have a 40–60% lifetime risk for developing endometrial cancer, a cancer associated with estrogen imbalance. The molecular basis for endometrial-specific tumorigenesis is unclear. Progestins inhibit estrogen-driven proliferation, and epidemiologic studies have demonstrated that progestin-containing oral contraceptives (OCP) reduce the risk of endometrial cancer by 50% in women at general population risk. It is unknown if they are effective in women with Lynch syndrome. Asymptomatic women age 25–50 with Lynch syndrome were randomized to receive the progestin compounds depo-Provera (depoMPA) or OCP for three months. An endometrial biopsy and transvaginal ultrasound were performed before and after treatment. Endometrial proliferation was evaluated as the primary endpoint. Histology and a panel of surrogate endpoint biomarkers were evaluated for each endometrial biopsy as secondary endpoints. A total of 51 women were enrolled, and 46 completed treatment. Two of the 51 women had complex hyperplasia with atypia at the baseline endometrial biopsy and were excluded from the study. Overall, both depoMPA and OCP induced a dramatic decrease in endometrial epithelial proliferation and microscopic changes in the endometrium characteristic of progestin action. Transvaginal ultrasound measurement of endometrial stripe was not a useful measure of endometrial response or baseline hyperplasia. These results demonstrate that women with Lynch syndrome do show an endometrial response to short term exogenous progestins, suggesting that OCP and depoMPA may be reasonable chemopreventive agents in this high-risk patient population.
endometrial cancer; chemoprevention; Lynch syndrome; progestin
To investigate the dosimetric impact of the heterogeneity dose calculation Acuros, a grid-based Boltzmann equation solver (GBBS), for brachytherapy in a cohort of cervical cancer patients.
Methods and Materials
The impact of heterogeneities was retrospectively assessed in treatment plans for 26 patients who had previously received 192Ir intracavitary brachytherapy for cervical cancer with computed tomography (CT)/magnetic resonance (MR)-compatible tandems and unshielded colpostats. The GBBS models sources, patient boundaries, applicators, and tissue heterogeneities. Multiple GBBS calculations were performed: with and without solid model applicator, with and without overriding the patient contour to 1g/cc muscle, and with and without overriding contrast materials to muscle or 2.25 g/cc bone. Impact of source and boundary modeling, applicator, tissue heterogeneities, and sensitivity of CT-to-material mapping of contrast were derived from the multiple calculations. TG-43 and the GBBS were compared for the following clinical dosimetric parameters: Manchester points A and B, ICRU report #38 rectal and bladder points, three and nine o'clock, and D2cc to the bladder, rectum, and sigmoid.
Points A, B, D2cc bladder, ICRU bladder, and three and nine o'clock were within 5% of TG-43 for all GBBS calculations. The source and boundary and applicator account for most of the differences between the GBBS and TG-43. The D2cc rectum (n=3), D2cc sigmoid (n=1), and ICRU rectum (n=6) had differences > 5% from TG-43 for the worst case incorrect mapping of contrast to bone. Clinical dosimetric parameters were within 5% of TG-43 when rectal and balloon contrast mapped to bone and radiopaque packing was not overridden.
The GBBS has minimal impact on clinical parameters for this cohort of GYN patients with unshielded applicators. The incorrect mapping of rectal and balloon contrast does not have a significant impact on clinical parameters. Rectal parameters may be sensitive to the mapping of radiopaque packing.
brachytherapy; intracavitary; 192Ir; grid-based Boltzmann solver; TG-43
To determine factors which may increase the likelihood of adverse drug events (ADEs) in recurrent endometrial cancer patients treated with pegylated liposomal doxorubicin (PLD) as well as this agent’s impact on clinical outcomes.
The treatment records of endometrial cancer patients who received PLD at The University of Texas, M.D. Anderson Cancer Center from 1996 to 2006 were reviewed. Patient demographics, PLD dose, ADEs, use of supportive care interventions, disease progression and survival were extracted. Logistical regression analysis was used to identify factors which were associated with higher incidence of ADEs and which influenced survival.
A total of 60 recurrent endometrial cancer patients were identified who experienced 122 ADEs. The most commonly reported ADEs were nausea (18.9%), palmar-plantar erythrodysesthesia (PPE) (16.4%), muscle weakness (12.3%), mucositis (10.7%), and peripheral neuropathy (9.8%). Seventeen patients (28%) required a dose reductiondue to ADEs. However, only five (8.3%) patients discontinued therapy because of toxicity. Cooling mechanisms were used in 19 patients to prevent PPE, although nine of these patients still experienced PPE. Treatment with six or more cycles of PLD was associated with increased incidence of neutropenia (p=0.045), peripheral neuropathy (p=0.004), and PPE (p<0.001). No differences in PFS or TTP was found between the doses of PLD, however there was an assessable trend toward increased survival with doses of 40mg/m2.
While there was no association with dose level and ADEs, more cycles received increased the incidence of toxicities, including PPE and neuropathy. There was no association between different doses of PLD and PFS or TTP.
Doxil; endometrial cancer; adverse effects; dose intensity
Resection of certain recurrent malignancies can prolong survival, but resection of recurrent pancreatic ductal adenocarcinoma is typically contraindicated because of poor outcomes.
All patients from 1992 to 2010 with recurrent pancreatic cancer after intended surgical cure were retrospectively evaluated. Clinicopathologic features were compared from patients who did and did not undergo subsequent reoperation with curative intent to identify factors associated with prolonged survival.
Twenty-one of 426 patients (5 %) with recurrent pancreatic cancer underwent potentially curative reoperation for solitary local-regional (n=7) or distant (n=14) recurrence. The median disease-free interval after initial resection among reoperative patients was longer for those with lung or local-regional recurrence (52.4 and 41.1 months, respectively) than for those with liver recurrence (7.6 months, p=0.006). The median interval between reoperation and second recurrence was longer in patients with lung recurrence (median not reached) than with liver or local-regional recurrence (6 and 9 months, respectively, p=0.023). Reoperative patients with an initial disease-free interval >20 months had a longer median survival than those who did not (92.3 versus 31.3 months, respectively; p=0.033).
Patients with a solitary pulmonary recurrence of pancreatic cancer after a prolonged disease-free interval should be considered for reoperation, as they are more likely to benefit from resection versus other sites of solitary recurrence.
Pancreatic ductal adenocarcinoma; Metastasectomy; Reoperation; Locoregional recurrence
Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes.
natural language processing; distributional semantics; concept extraction; named entity recognition; empirical lexical resources
No reliable methods currently exist to predict patient response to intravesical immunotherapy with bacillus Calmette-Guérin (BCG), given after transurethral resection for high-risk non-muscle-invasive bladder cancer. We initiated a prospective clinical trial to determine whether fluorescence in situ hybridization (FISH) results during BCG immunotherapy can predict therapy failure.
Materials and Methods
Candidates for standard of care BCG were offered participation in a clinical trial. FISH was performed prior to BCG and at 6 weeks, 3 months, and 6 months during BCG therapy with maintenance. Cox proportional hazards regression was used to assess the relationship between FISH results and tumor recurrence or progression; the Kaplan-Meier product limit method was used to estimate recurrence- and progression-free survival.
One hundred twenty-six patients participated. At a median follow-up of 24 months, 31% of patients had recurrent tumors and 14% experienced disease progression. Patients who had positive FISH results during BCG therapy were 3-5 times more likely than those who had negative FISH results to develop recurrent tumors and 5-13 times more likely to experience disease progression (p < 0.01). The timing of positive FISH results also affected outcome; for example, patients with a negative FISH result at baseline, 6 weeks, and 3 months demonstrated an 8.3% recurrence rate, compared to 48.1% in those with a positive FISH result at all three time points.
FISH results can identify patients who are at risk of tumor recurrence and progression during BCG immunotherapy. This information may be used to counsel patients about alternative treatment strategies.
bladder cancer; BCG; FISH; response; prediction
Extracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes a basic enabling technology to unlock the knowledge within and support more advanced reasoning applications such as diagnosis explanation, disease progression modeling, and intelligent analysis of the effectiveness of treatment. The recent release of annotated training sets of de-identified clinical narratives has contributed to the development and refinement of concept extraction methods. However, as the annotation process is labor-intensive, training data are necessarily limited in the concepts and concept patterns covered, which impacts the performance of supervised machine learning applications trained with these data. This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text.
The approach uses a sequential discriminative classifier (Conditional Random Fields) to extract the mentions of medical problems, treatments and tests from clinical narratives. It takes advantage of all Medline abstracts indexed as being of the publication type “clinical trials” to estimate the relatedness between words in the i2b2/VA training and testing corpora. In addition to the traditional features such as dictionary matching, pattern matching and part-of-speech tags, we also used as a feature words that appear in similar contexts to the word in question (that is, words that have a similar vector representation measured with the commonly used cosine metric, where vector representations are derived using methods of distributional semantics). To the best of our knowledge, this is the first effort exploring the use of distributional semantics, the semantics derived empirically from unannotated text often using vector space models, for a sequence classification task such as concept extraction. Therefore, we first experimented with different sliding window models and found the model with parameters that led to best performance in a preliminary sequence labeling task.
The evaluation of this approach, performed against the i2b2/VA concept extraction corpus, showed that incorporating features based on the distribution of words across a large unannotated corpus significantly aids concept extraction. Compared to a supervised-only approach as a baseline, the micro-averaged f-measure for exact match increased from 80.3% to 82.3% and the micro-averaged f-measure based on inexact match increased from 89.7% to 91.3%. These improvements are highly significant according to the bootstrap resampling method and also considering the performance of other systems. Thus, distributional semantic features significantly improve the performance of concept extraction from clinical narratives by taking advantage of word distribution information obtained from unannotated data.
NLP; Information extraction; NER; Distributional Semantics; Clinical Informatics
Phylogeography is a field that focuses on the geographical lineages of species such as vertebrates or viruses. Here, geographical data, such as location of a species or viral host is as important as the sequence information extracted from the species. Together, this information can help illustrate the migration of the species over time within a geographical area, the impact of geography over the evolutionary history, or the expected population of the species within the area. Molecular sequence data from NCBI, specifically GenBank, provide an abundance of available sequence data for phylogeography. However, geographical data is inconsistently represented and sparse across GenBank entries. This can impede analysis and in situations where the geographical information is inferred, and potentially lead to erroneous results. In this paper, we describe the current state of geographical data in GenBank, and illustrate how automated processing techniques such as named entity recognition, can enhance the geographical data available for phylogeographic studies.
Phylogeography; Databases; Nucleic Acid; Geographic Locations; Bioinformatics
Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.
A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%.
The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.
Summary: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987.
Availability: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net.
Rapid growth of online health social networks has enabled patients to communicate more easily with each other. This way of exchange of opinions and experiences has provided a rich source of information about drugs and their effectiveness and more importantly, their possible adverse reactions. We developed a system to automatically extract mentions of Adverse Drug Reactions (ADRs) from user reviews about drugs in social network websites by mining a set of language patterns. The system applied association rule mining on a set of annotated comments to extract the underlying patterns of colloquial expressions about adverse effects. The patterns were tested on a set of unseen comments to evaluate their performance. We reached to precision of 70.01% and recall of 66.32% and F-measure of 67.96%.
BioSimplify is an open source tool written in Java that introduces and facilitates the use of a novel model for sentence simplification tuned for automatic discourse analysis and information extraction (as opposed to sentence simplification for improving human readability). The model is based on a “shot-gun” approach that produces many different (simpler) versions of the original sentence by combining variants of its constituent elements. This tool is optimized for processing biomedical scientific literature such as the abstracts indexed in PubMed. We tested our tool on its impact to the task of PPI extraction and it improved the f-score of the PPI tool by around 7%, with an improvement in recall of around 20%. The BioSimplify tool and test corpus can be downloaded from https://biosimplify.sourceforge.net
To evaluate the performance of the Human Papillomavirus High-Risk DNA test in patients 30 years and older.
Materials and Methods
Screening (N=835) and diagnosis (N=518) groups were defined based on prior Papanicolaou smear results as part of a clinical trial for cervical cancer detection. We compared the Hybrid Capture II® (HCII) test result to the worst histological report. We used cervical intraepithelial neoplasia (CIN) 2/3 or worse as the reference of disease. We calculated sensitivities, specificities, positive and negative likelihood ratios (LR+ and LR−), receiver operating characteristic (ROC) curves, and areas under the ROC curves for the HCII test. We also considered alternative strategies, including Papanicolaou smear, a combination of Papanicolaou smear and the HCII test, a sequence of Papanicolaou smear followed by the HCII test, and a sequence of the HCII test followed by Papanicolaou smear.
For the screening group, the sensitivity was 0.69 and the specificity was 0.93; the area under the ROC curve was 0.81. The LR+ and LR− were 10.24 and 0.34, respectively. For the diagnosis group, the sensitivity was 0.88 and the specificity was 0.78; the area under the ROC curve was 0.83. The LR+ and LR− were 4.06 and 0.14, respectively. Sequential testing showed little or no improvement over the combination testing.
The HCII test in the screening group had a greater LR+ for the detection of CIN 2/3 or worse. HCII testing may be an additional screening tool for cervical cancer in women 30 years and older.
cervical intraepithelial neoplasia; cervix neoplasms; DNA probes HPV; sensitivity and specificity
With the overwhelming volume of genomic and molecular information available on many databases nowadays, researchers need from bioinformaticians more than encouragement to refine their searches. We present here GeneRanker, an online system that allows researchers to obtain a ranked list of genes potentially related to a specific disease or biological process by combining gene-disease (or genebiological process) associations with protein-protein interactions extracted from the literature, using computational analysis of the protein network topology to more accurately rank the predicted associations. GeneRanker was evaluated in the context of brain cancer research, and is freely available online at http://www.generanker.org.