Search tips
Search criteria

Results 1-25 (836980)

Clipboard (0)

Related Articles

1.  Challenges for automatically extracting molecular interactions from full-text articles 
BMC Bioinformatics  2009;10:311.
The increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles.
We manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved.
We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set.
We introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks.
PMCID: PMC2761905  PMID: 19778419
2.  Automatic extraction of semantic relations between medical entities: a rule based approach 
Journal of Biomedical Semantics  2011;2(Suppl 5):S4.
Information extraction is a complex task which is necessary to develop high-precision information retrieval tools. In this paper, we present the platform MeTAE (Medical Texts Annotation and Exploration). MeTAE allows (i) to extract and annotate medical entities and relationships from medical texts and (ii) to explore semantically the produced RDF annotations.
Our annotation approach relies on linguistic patterns and domain knowledge and consists in two steps: (i) recognition of medical entities and (ii) identification of the correct semantic relation between each pair of entities. The first step is achieved by an enhanced use of MetaMap which improves the precision obtained by MetaMap by 19.59% in our evaluation. The second step relies on linguistic patterns which are built semi-automatically from a corpus selected according to semantic criteria. We evaluate our system’s ability to identify medical entities of 16 types. We also evaluate the extraction of treatment relations between a treatment (e.g. medication) and a problem (e.g. disease): we obtain 75.72% precision and 60.46% recall.
According to our experiments, using an external sentence segmenter and noun phrase chunker may improve the precision of MetaMap-based medical entity recognition. Our pattern-based relation extraction method obtains good precision and recall w.r.t related works. A more precise comparison with related approaches remains difficult however given the differences in corpora and in the exact nature of the extracted relations. The selection of MEDLINE articles through queries related to known drug-disease pairs enabled us to obtain a more focused corpus of relevant examples of treatment relations than a more general MEDLINE query.
PMCID: PMC3239304  PMID: 22166723
3.  Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents 
BMC Bioinformatics  2010;11(Suppl 2):S1.
Drug-drug interactions are frequently reported in the increasing amount of biomedical literature. Information Extraction (IE) techniques have been devised as a useful instrument to manage this knowledge. Nevertheless, IE at the sentence level has a limited effect because of the frequent references to previous entities in the discourse, a phenomenon known as 'anaphora'. DrugNerAR, a drug anaphora resolution system is presented to address the problem of co-referring expressions in pharmacological literature. This development is part of a larger and innovative study about automatic drug-drug interaction extraction.
The system uses a set of linguistic rules drawn by Centering Theory over the analysis provided by a biomedical syntactic parser. Semantic information provided by the Unified Medical Language System (UMLS) is also integrated in order to improve the recognition and the resolution of nominal drug anaphors. Besides, a corpus has been developed in order to analyze the phenomena and evaluate the current approach. Each possible case of anaphoric expression was looked into to determine the most effective way of resolution.
An F-score of 0.76 in anaphora resolution was achieved, outperforming significantly the baseline by almost 73%. This ad-hoc reference line was developed to check the results as there is no previous work on anaphora resolution in pharmalogical documents. The obtained results resemble those found in related-semantic domains.
The present approach shows very promising results in the challenge of accounting for anaphoric expressions in pharmacological texts. DrugNerAr obtains similar results to other approaches dealing with anaphora resolution in the biomedical domain, but, unlike these approaches, it focuses on documents reflecting drug interactions. The Centering Theory has proved being effective at the selection of antecedents in anaphora resolution. A key component in the success of this framework is the analysis provided by the MMTx program and the DrugNer system that allows to deal with the complexity of the pharmacological language. It is expected that the positive results of the resolver increases performance of our future drug-drug interaction extraction system.
PMCID: PMC3288782  PMID: 20406499
4.  Sentence retrieval for abstracts of randomized controlled trials 
The practice of evidence-based medicine (EBM) requires clinicians to integrate their expertise with the latest scientific research. But this is becoming increasingly difficult with the growing numbers of published articles. There is a clear need for better tools to improve clinician's ability to search the primary literature. Randomized clinical trials (RCTs) are the most reliable source of evidence documenting the efficacy of treatment options. This paper describes the retrieval of key sentences from abstracts of RCTs as a step towards helping users find relevant facts about the experimental design of clinical studies.
Using Conditional Random Fields (CRFs), a popular and successful method for natural language processing problems, sentences referring to Intervention, Participants and Outcome Measures are automatically categorized. This is done by extending a previous approach for labeling sentences in an abstract for general categories associated with scientific argumentation or rhetorical roles: Aim, Method, Results and Conclusion. Methods are tested on several corpora of RCT abstracts. First structured abstracts with headings specifically indicating Intervention, Participant and Outcome Measures are used. Also a manually annotated corpus of structured and unstructured abstracts is prepared for testing a classifier that identifies sentences belonging to each category.
Using CRFs, sentences can be labeled for the four rhetorical roles with F-scores from 0.93–0.98. This outperforms the use of Support Vector Machines. Furthermore, sentences can be automatically labeled for Intervention, Participant and Outcome Measures, in unstructured and structured abstracts where the section headings do not specifically indicate these three topics. F-scores of up to 0.83 and 0.84 are obtained for Intervention and Outcome Measure sentences.
Results indicate that some of the methodological elements of RCTs are identifiable at the sentence level in both structured and unstructured abstract reports. This is promising in that sentences labeled automatically could potentially form concise summaries, assist in information retrieval and finer-grained extraction.
PMCID: PMC2657779  PMID: 19208256
5.  Creation of a corpus for evidence based medicine summarisation 
The Australasian Medical Journal  2012;5(9):503-506.
Automated text summarisers that find the best clinical evidence reported in collections of medical literature are of potential benefit for the practice of Evidence Based Medicine (EBM). Research and development of text summarisers for EBM, however, is impeded by the lack of corpora to train and test such systems.
To produce a corpus for research in EBM summarisation.
We sourced the “Clinical Inquiries” section of the Journal of Family Practice (JFP) and obtained a sizeable sample of questions and evidence based summaries. We further processed the summaries by combining automated techniques, human annotations, and crowdsourcing techniques to identify the PubMed IDs of the references.
The corpus has 456 questions, 1,396 answer components, 3,036 answer justifications, and 2,908 references.
The corpus is now available for the research community at
PMCID: PMC3477779  PMID: 23115585
Evidence Based Medicine; corpora; text summarisation; natural language processing.
6.  Event extraction of bacteria biotopes: a knowledge-intensive NLP-based approach 
BMC Bioinformatics  2012;13(Suppl 11):S8.
Bacteria biotopes cover a wide range of diverse habitats including animal and plant hosts, natural, medical and industrial environments. The high volume of publications in the microbiology domain provides a rich source of up-to-date information on bacteria biotopes. This information, as found in scientific articles, is expressed in natural language and is rarely available in a structured format, such as a database. This information is of great importance for fundamental research and microbiology applications (e.g., medicine, agronomy, food, bioenergy). The automatic extraction of this information from texts will provide a great benefit to the field.
We present a new method for extracting relationships between bacteria and their locations using the Alvis framework. Recognition of bacteria and their locations was achieved using a pattern-based approach and domain lexical resources. For the detection of environment locations, we propose a new approach that combines lexical information and the syntactic-semantic analysis of corpus terms to overcome the incompleteness of lexical resources. Bacteria location relations extend over sentence borders, and we developed domain-specific rules for dealing with bacteria anaphors.
We participated in the BioNLP 2011 Bacteria Biotope (BB) task with the Alvis system. Official evaluation results show that it achieves the best performance of participating systems. New developments since then have increased the F-score by 4.1 points.
We have shown that the combination of semantic analysis and domain-adapted resources is both effective and efficient for event information extraction in the bacteria biotope domain. We plan to adapt the method to deal with a larger set of location types and a large-scale scientific article corpus to enable microbiologists to integrate and use the extracted knowledge in combination with experimental data.
PMCID: PMC3384252  PMID: 22759462
7.  Evidence for the Selective Reporting of Analyses and Discrepancies in Clinical Trials: A Systematic Review of Cohort Studies of Clinical Trials 
PLoS Medicine  2014;11(6):e1001666.
In a systematic review of cohort studies, Kerry Dwan and colleagues examine the evidence for selective reporting and discrepancies in analyses between journal publications and other documents for clinical trials.
Please see later in the article for the Editors' Summary
Most publications about selective reporting in clinical trials have focussed on outcomes. However, selective reporting of analyses for a given outcome may also affect the validity of findings. If analyses are selected on the basis of the results, reporting bias may occur. The aims of this study were to review and summarise the evidence from empirical cohort studies that assessed discrepant or selective reporting of analyses in randomised controlled trials (RCTs).
Methods and Findings
A systematic review was conducted and included cohort studies that assessed any aspect of the reporting of analyses of RCTs by comparing different trial documents, e.g., protocol compared to trial report, or different sections within a trial publication. The Cochrane Methodology Register, Medline (Ovid), PsycInfo (Ovid), and PubMed were searched on 5 February 2014. Two authors independently selected studies, performed data extraction, and assessed the methodological quality of the eligible studies. Twenty-two studies (containing 3,140 RCTs) published between 2000 and 2013 were included. Twenty-two studies reported on discrepancies between information given in different sources. Discrepancies were found in statistical analyses (eight studies), composite outcomes (one study), the handling of missing data (three studies), unadjusted versus adjusted analyses (three studies), handling of continuous data (three studies), and subgroup analyses (12 studies). Discrepancy rates varied, ranging from 7% (3/42) to 88% (7/8) in statistical analyses, 46% (36/79) to 82% (23/28) in adjusted versus unadjusted analyses, and 61% (11/18) to 100% (25/25) in subgroup analyses. This review is limited in that none of the included studies investigated the evidence for bias resulting from selective reporting of analyses. It was not possible to combine studies to provide overall summary estimates, and so the results of studies are discussed narratively.
Discrepancies in analyses between publications and other study documentation were common, but reasons for these discrepancies were not discussed in the trial reports. To ensure transparency, protocols and statistical analysis plans need to be published, and investigators should adhere to these or explain discrepancies.
Please see later in the article for the Editors' Summary
Editors' Summary
In the past, clinicians relied on their own experience when choosing the best treatment for their patients. Nowadays, they turn to evidence-based medicine—the systematic review and appraisal of trials, studies that investigate the benefits and harms of medical treatments in patients. However, evidence-based medicine can guide clinicians only if all the results from clinical trials are published in an unbiased and timely manner. Unfortunately, the results of trials in which a new drug performs better than existing drugs are more likely to be published than those in which the new drug performs badly or has unwanted side effects (publication bias). Moreover, trial outcomes that support the use of a new treatment are more likely to be published than those that do not support its use (outcome reporting bias). Recent initiatives—such as making registration of clinical trials in a trial registry (for example, a prerequisite for publication in medical journals—aim to prevent these biases, which pose a threat to informed medical decision-making.
Why Was This Study Done?
Selective reporting of analyses of outcomes may also affect the validity of clinical trial findings. Sometimes, for example, a trial publication will include a per protocol analysis (which considers only the outcomes of patients who received their assigned treatment) rather than a pre-planned intention-to-treat analysis (which considers the outcomes of all the patients regardless of whether they received their assigned treatment). If the decision to publish the per protocol analysis is based on the results of this analysis being more favorable than those of the intention-to-treat analysis (which more closely resembles “real” life), then “analysis reporting bias” has occurred. In this systematic review, the researchers investigate the selective reporting of analyses and discrepancies in randomized controlled trials (RCTs) by reviewing published studies that assessed selective reporting of analyses in groups (cohorts) of RCTs and discrepancies in analyses of RCTs between different sources (for example, between the protocol in a trial registry and the journal publication) or different sections of a source. A systematic review uses predefined criteria to identify all the research on a given topic.
What Did the Researchers Do and Find?
The researchers identified 22 cohort studies (containing 3,140 RCTs) that were eligible for inclusion in their systematic review. All of these studies reported on discrepancies between the information provided by the RCTs in different places, but none investigated the evidence for analysis reporting bias. Several of the cohort studies reported, for example, that there were discrepancies in the statistical analyses included in the different documents associated with the RCTs included in their analysis. Other types of discrepancies reported by the cohort studies included discrepancies in the reporting of composite outcomes (an outcome in which multiple end points are combined) and in the reporting of subgroup analyses (investigations of outcomes in subgroups of patients that should be predefined in the trial protocol to avoid bias). Discrepancy rates varied among the RCTs according to the types of analyses and cohort studies considered. Thus, whereas in one cohort study discrepancies were present in the statistical test used for the analysis of the primary outcome in only 7% of the included studies, they were present in the subgroup analyses of all the included studies.
What Do These Findings Mean?
These findings indicate that discrepancies in analyses between publications and other study documents such as protocols in trial registries are common. The reasons for these discrepancies in analyses were not discussed in trial reports but may be the result of reporting bias, errors, or legitimate departures from a pre-specified protocol. For example, a statistical analysis that is not specified in the trial protocol may sometimes appear in a publication because the journal requested its inclusion as a condition of publication. The researchers suggest that it may be impossible for systematic reviewers to distinguish between these possibilities simply by looking at the source documentation. Instead, they suggest, it may be necessary for reviewers to contact the trial authors. However, to make selective reporting of analyses more easily detectable, they suggest that protocols and analysis plans should be published and that investigators should be required to stick to these plans or explain any discrepancies when they publish their trial results. Together with other initiatives, this approach should help improve the quality of evidence-based medicine and, as a result, the treatment of patients.
Additional Information
Please access these websites via the online version of this summary at
Wikipedia has pages on evidence-based medicine, on systematic reviews, and on publication bias (note: Wikipedia is a free online encyclopedia that anyone can edit; available in several languages) provides information about the US National Institutes of Health clinical trial registry, including background information about clinical trials
The Cochrane Collaboration is a global independent network of health practitioners, researchers, patient advocates, and others that aims to promote evidence-informed health decision-making by producing high-quality, relevant, accessible systematic reviews and other synthesized research evidence; the Cochrane Handbook for Systematic Reviews of Interventions describes the preparation of systematic reviews in detail
PLOS Medicine recently launched a Reporting Guidelines Collection, an open-access collection of reporting guidelines, commentary, and related research on guidelines from across PLOS journals that aims to help advance the efficiency, effectiveness, and equitability of the dissemination of biomedical information
PMCID: PMC4068996  PMID: 24959719
8.  Learning to rank-based gene summary extraction 
BMC Bioinformatics  2014;15(Suppl 12):S10.
In recent years, the biomedical literature has been growing rapidly. These articles provide a large amount of information about proteins, genes and their interactions. Reading such a huge amount of literature is a tedious task for researchers to gain knowledge about a gene. As a result, it is significant for biomedical researchers to have a quick understanding of the query concept by integrating its relevant resources.
In the task of gene summary generation, we regard automatic summary as a ranking problem and apply the method of learning to rank to automatically solve this problem. This paper uses three features as a basis for sentence selection: gene ontology relevance, topic relevance and TextRank. From there, we obtain the feature weight vector using the learning to rank algorithm and predict the scores of candidate summary sentences and obtain top sentences to generate the summary.
ROUGE (a toolkit for summarization of automatic evaluation) was used to evaluate the summarization result and the experimental results showed that our method outperforms the baseline techniques.
According to the experimental result, the combination of three features can improve the performance of summary. The application of learning to rank can facilitate the further expansion of features for measuring the significance of sentences.
PMCID: PMC4243090  PMID: 25474678
9.  MeSH: a window into full text for document summarization 
Bioinformatics  2011;27(13):i120-i128.
Motivation: Previous research in the biomedical text-mining domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the realm of full texts. Full texts are, however, more expensive to process not only in terms of resources needed but also in terms of accuracy. Since full texts contain embellishments that elaborate, contextualize, contrast, supplement, etc., there is greater risk for false positives. Motivated by this, we explore an approach that offers a compromise between the extremes of abstracts and full texts. Specifically, we create reduced versions of full text documents that contain only important portions. In the long-term, our goal is to explore the use of such summaries for functions such as document retrieval and information extraction. Here, we focus on designing summarization strategies. In particular, we explore the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full text documents.
Results: Our experiments confirm the ability of our approach to pick the important text portions. Using the ROUGE measures for evaluation, we were able to achieve maximum ROUGE-1, ROUGE-2 and ROUGE-SU4 F-scores of 0.4150, 0.1435 and 0.1782, respectively, for our MeSH term-based method versus the maximum baseline scores of 0.3815, 0.1353 and 0.1428, respectively. Using a MeSH profile-based strategy, we were able to achieve maximum ROUGE F-scores of 0.4320, 0.1497 and 0.1887, respectively. Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full texts.
PMCID: PMC3117369  PMID: 21685060
10.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature 
PLoS Biology  2004;2(11):e309.
We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at or via WormBase at
With the increasing availability of full-text scientific papers online, new tools, such as Textpresso, will help to extract information and knowledge from research literature
PMCID: PMC517822  PMID: 15383839
11.  Automatic extraction of candidate nomenclature terms using the doublet method 
New terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The scholarly curator adds new terms as they are encountered. Present-day scholars are severely challenged by the enormous volume of biomedical literature. Curators of medical nomenclatures need computational assistance if they hope to keep their terminologies current. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. The resulting lists of terms can be quickly reviewed by curators and added to nomenclatures, if appropriate. The candidate term extractor uses a variation of the previously described doublet coding method. The algorithm, which operates on virtually any nomenclature, derives from the observation that most terms within a knowledge domain are composed entirely of word combinations found in other terms from the same knowledge domain. Terms can be expressed as sequences of overlapping word doublets that have more specific meaning than the individual words that compose the term. The algorithm parses through text, finding contiguous sequences of word doublets that are known to occur somewhere in the reference nomenclature. When a sequence of matching word doublets is encountered, it is compared with whole terms already included in the nomenclature. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. Candidate new terms can be reviewed by a curator to determine if they should be added to the nomenclature. An implementation of the algorithm is demonstrated, using a corpus of published abstracts obtained through the National Library of Medicine's PubMed query service and using "The developmental lineage classification and taxonomy of neoplasms" as a reference nomenclature.
A 31+ Megabyte corpus of pathology journal abstracts was parsed using the doublet extraction method. This corpus consisted of 4,289 records, each containing an abstract title. The total number of words included in the abstract titles was 50,547. New candidate terms for the nomenclature were automatically extracted from the titles of abstracts in the corpus. Total execution time on a desktop computer with CPU speed of 2.79 GHz was 2 seconds. The resulting output consisted of 313 new candidate terms, each consisting of concatenated doublets found in the reference nomenclature. Human review of the 313 candidate terms yielded a list of 285 terms approved by a curator. A final automatic extraction of duplicate terms yielded a final list of 222 new terms (71% of the original 313 extracted candidate terms) that could be added to the reference nomenclature.
The doublet method for automatically extracting candidate nomenclature terms can be used to quickly find new terms from vast amounts of text. The method can be immediately adapted for virtually any text and any nomenclature. An implementation of the algorithm, in the Perl programming language, is provided with this article.
PMCID: PMC1274323  PMID: 16232314
12.  A semantic-based method for extracting concept definitions from scientific publications: evaluation in the autism phenotype domain 
A variety of informatics approaches have been developed that use information retrieval, NLP and text-mining techniques to identify biomedical concepts and relations within scientific publications or their sentences. These approaches have not typically addressed the challenge of extracting more complex knowledge such as biomedical definitions. In our efforts to facilitate knowledge acquisition of rule-based definitions of autism phenotypes, we have developed a novel semantic-based text-mining approach that can automatically identify such definitions within text.
Using an existing knowledge base of 156 autism phenotype definitions and an annotated corpus of 26 source articles containing such definitions, we evaluated and compared the average rank of correctly identified rule definition or corresponding rule template using both our semantic-based approach and a standard term-based approach. We examined three separate scenarios: (1) the snippet of text contained a definition already in the knowledge base; (2) the snippet contained an alternative definition for a concept in the knowledge base; and (3) the snippet contained a definition not in the knowledge base. Our semantic-based approach had a higher average rank than the term-based approach for each of the three scenarios (scenario 1: 3.8 vs. 5.0; scenario 2: 2.8 vs. 4.9; and scenario 3: 4.5 vs. 6.2), with each comparison significant at the p-value of 0.05 using the Wilcoxon signed-rank test.
Our work shows that leveraging existing domain knowledge in the information extraction of biomedical definitions significantly improves the correct identification of such knowledge within sentences. Our method can thus help researchers rapidly acquire knowledge about biomedical definitions that are specified and evolving within an ever-growing corpus of scientific publications.
PMCID: PMC3765483  PMID: 23937724
Knowledge acquisition; Ontologies; Rules; Biomedical definitions; Autism phenotypes
13.  A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature 
PLoS Computational Biology  2010;6(7):e1000837.
The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein–protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods.
Author Summary
The most important way of conveying new findings in biomedical research is scientific publication. In turn, the most recent and most important findings can only be found by carefully reading the scientific literature, which becomes more and more of a problem because of the enormous number of published articles. This situation has led to the development of various computational approaches to the automatic extraction of important facts from articles, mostly concentrating on the recognition of protein names and on interactions between proteins (PPI). However, so far there is little agreement on which methods perform best for which task. Our paper reports on an extensive comparison of nine recent PPI extraction tools. We studied their performance in various settings on a set of five different text collections containing articles describing PPIs, which for the first time allows for an unbiased comparison of their respective effectiveness. Our results show that the tools' performance depends largely on the collection they are trained on and the collection they are then evaluated on, which means that extrapolating their measured performance to arbitrary text is still highly problematic. We also show that certain classes of methods for extracting PPIs are clearly superior to other classes.
PMCID: PMC2895635  PMID: 20617200
14.  Unsupervised grammar induction of clinical report sublanguage 
Journal of Biomedical Semantics  2012;3(Suppl 3):S4.
Clinical reports are written using a subset of natural language while employing many domain-specific terms; such a language is also known as a sublanguage for a scientific or a technical domain. Different genres of clinical reports use different sublaguages, and in addition, different medical facilities use different medical language conventions. This makes supervised training of a parser for clinical sentences very difficult as it would require expensive annotation effort to adapt to every type of clinical text.
In this paper, we present an unsupervised method which automatically induces a grammar and a parser for the sublanguage of a given genre of clinical reports from a corpus with no annotations. In order to capture sentence structures specific to clinical domains, the grammar is induced in terms of semantic classes of clinical terms in addition to part-of-speech tags. Our method induces grammar by minimizing the combined encoding cost of the grammar and the corresponding sentence derivations. The probabilities for the productions of the induced grammar are then learned from the unannotated corpus using an instance of the expectation-maximization algorithm.
Our experiments show that the induced grammar is able to parse novel sentences. Using a dataset of discharge summary sentences with no annotations, our method obtains 60.5% F-measure for parse-bracketing on sentences of maximum length 10. By varying a parameter, the method can induce a range of grammars, from very specific to very general, and obtains the best performance in between the two extremes.
PMCID: PMC3465207  PMID: 23046834
15.  Knowledge-based extraction of adverse drug events from biomedical text 
BMC Bioinformatics  2014;15:64.
Many biomedical relation extraction systems are machine-learning based and have to be trained on large annotated corpora that are expensive and cumbersome to construct. We developed a knowledge-based relation extraction system that requires minimal training data, and applied the system for the extraction of adverse drug events from biomedical text. The system consists of a concept recognition module that identifies drugs and adverse effects in sentences, and a knowledge-base module that establishes whether a relation exists between the recognized concepts. The knowledge base was filled with information from the Unified Medical Language System. The performance of the system was evaluated on the ADE corpus, consisting of 1644 abstracts with manually annotated adverse drug events. Fifty abstracts were used for training, the remaining abstracts were used for testing.
The knowledge-based system obtained an F-score of 50.5%, which was 34.4 percentage points better than the co-occurrence baseline. Increasing the training set to 400 abstracts improved the F-score to 54.3%. When the system was compared with a machine-learning system, jSRE, on a subset of the sentences in the ADE corpus, our knowledge-based system achieved an F-score that is 7 percentage points higher than the F-score of jSRE trained on 50 abstracts, and still 2 percentage points higher than jSRE trained on 90% of the corpus.
A knowledge-based approach can be successfully used to extract adverse drug events from biomedical text without need for a large training set. Whether use of a knowledge base is equally advantageous for other biomedical relation-extraction tasks remains to be investigated.
PMCID: PMC3973995  PMID: 24593054
Relation extraction; Knowledge base; Adverse drug effect
16.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users 
Bioinformatics  2008;24(18):2086-2093.
Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks.
Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice.
PMCID: PMC2530883  PMID: 18718948
17.  Comparison of Machine Learning Algorithms for Classification of the Sentences in Three Clinical Practice Guidelines 
Clinical Practice Guidelines (CPGs) are an effective tool for minimizing the gap between a physician's clinical decision and medical evidence and for modeling the systematic and standardized pathway used to provide better medical treatment to patients.
In this study, sentences within the clinical guidelines are categorized according to a classification system. We used three clinical guidelines that incorporated knowledge from medical experts in the field of family medicine. These were the seventh report of the Joint National Committee (JNC7) on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure from the National Heart, Lung, and Blood Institute; the third report of the National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults from the same institution; and the Standards of Medical Care in Diabetes 2010 report from the American Diabetes Association. Three annotators each tagged 346 sentences hand-chosen from these three clinical guidelines. The three annotators then carried out cross-validations of the tagged corpus. We also used various machine learning-based classifiers for sentence classification.
We conducted experiments using real-valued features and token units, as well as a Boolean feature. The results showed that the combination of maximum entropy-based learning and information gain-based feature extraction gave the best classification performance (over 98% f-measure) in four sentence categories.
This result confirmed the contribution of the feature reduction algorithm and optimal technique for very sparse feature spaces, such as the sentence classification problem in the clinical guideline document.
PMCID: PMC3633167  PMID: 23626914
Knowledge Bases; Data Mining; Information Storage and Retrieval
18.  Evaluating the use of different positional strategies for sentence selection in biomedical literature summarization 
BMC Bioinformatics  2013;14:71.
The position of a sentence in a document has been traditionally considered an indicator of the relevance of the sentence, and therefore it is frequently used by automatic summarization systems as an attribute for sentence selection. Sentences close to the beginning of the document are supposed to deal with the main topic and thus are selected for the summary. This criterion has shown to be very effective when summarizing some types of documents, such as news items. However, this property is not likely to be found in other types of documents, such as scientific articles, where other positional criteria may be preferred. The purpose of the present work is to study the utility of different positional strategies for biomedical literature summarization.
We have evaluated three different positional strategies: (1) awarding the sentences at the beginning of the document, (2) preferring those at the beginning and end of the document, and (3) weighting the sentences according to the section in which they appear. To this end, we have implemented two summarizers, one based on semantic graphs and the other based on concept frequencies, and evaluated the summaries they produce when combined with each of the positional strategies above using ROUGE metrics. Our results indicate that it is possible to improve the quality of the summaries by weighting the sentences according to the section in which they appear (≈17% improvement in ROUGE-2 for the graph-based summarizer and ≈20% for the frequency-based summarizer), and that the sections containing the more salient information are the Methods and Material and the Discussion and Results ones.
It has been found that the use of traditional positional criteria that award sentences at the beginning and/or the end of the document are not helpful when summarizing scientific literature. In contrast, a more appropriate strategy is that which weights sentences according to the section in which they appear.
PMCID: PMC3648362  PMID: 23445074
19.  Automated de-identification of free-text medical records 
Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification.
We describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus.
Performance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus.
We have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.
PMCID: PMC2526997  PMID: 18652655
20.  Extracting semantically enriched events from biomedical literature 
BMC Bioinformatics  2012;13:108.
Research into event-based text mining from the biomedical literature has been growing in popularity to facilitate the development of advanced biomedical text mining systems. Such technology permits advanced search, which goes beyond document or sentence-based retrieval. However, existing event-based systems typically ignore additional information within the textual context of events that can determine, amongst other things, whether an event represents a fact, hypothesis, experimental result or analysis of results, whether it describes new or previously reported knowledge, and whether it is speculated or negated. We refer to such contextual information as meta-knowledge. The automatic recognition of such information can permit the training of systems allowing finer-grained searching of events according to the meta-knowledge that is associated with them.
Based on a corpus of 1,000 MEDLINE abstracts, fully manually annotated with both events and associated meta-knowledge, we have constructed a machine learning-based system that automatically assigns meta-knowledge information to events. This system has been integrated into EventMine, a state-of-the-art event extraction system, in order to create a more advanced system (EventMine-MK) that not only extracts events from text automatically, but also assigns five different types of meta-knowledge to these events. The meta-knowledge assignment module of EventMine-MK performs with macro-averaged F-scores in the range of 57-87% on the BioNLP’09 Shared Task corpus. EventMine-MK has been evaluated on the BioNLP’09 Shared Task subtask of detecting negated and speculated events. Our results show that EventMine-MK can outperform other state-of-the-art systems that participated in this task.
We have constructed the first practical system that extracts both events and associated, detailed meta-knowledge information from biomedical literature. The automatically assigned meta-knowledge information can be used to refine search systems, in order to provide an extra search layer beyond entities and assertions, dealing with phenomena such as rhetorical intent, speculations, contradictions and negations. This finer grained search functionality can assist in several important tasks, e.g., database curation (by locating new experimental knowledge) and pathway enrichment (by providing information for inference). To allow easy integration into text mining systems, EventMine-MK is provided as a UIMA component that can be used in the interoperable text mining infrastructure, U-Compare.
PMCID: PMC3464657  PMID: 22621266
21.  Medical Textbook Summarization and Guided Navigation using Statistical Sentence Extraction 
We present a method for automated medical textbook and encyclopedia summarization. Using statistical sentence extraction and semantic relationships, we extract sentences from text returned as part of an existing textbook search (similar to a book index). Our system guides users to the information they desire by summarizing the content of each relevant chapter or section returned through the search. The summary is tailored to contain sentences that specifically address the user’s search terms. Our clustering method selects sentences that contain concepts specifically addressing the context of the query term in each of the returned sections. Our method examines conceptual relationships from the UMLS and selects clusters of concepts using Expectation Maximization (EM). Sentences associated with the concept clusters are shown to the user. We evaluated whether our extracted summary provides a suitable answer to the user’s question.
PMCID: PMC1560740  PMID: 16779153
22.  Enhancing Biomedical Text Summarization Using Semantic Relation Extraction 
PLoS ONE  2011;6(8):e23862.
Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1) We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2) We develop a relation-level retrieval method to select the relations most relevant to each query concept and visualize them in a graphic representation. 3) For relations in the relevant set, we extract informative sentences that can interpret them from the document collection to generate text summary using an information retrieval based method. Our major focus in this work is to investigate the contribution of semantic relation extraction to the task of biomedical text summarization. The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and our results are better than the MEAD system, a well-known tool for text summarization.
PMCID: PMC3162578  PMID: 21887336
23.  ContextD: an algorithm to identify contextual properties of medical terms in a Dutch clinical corpus 
BMC Bioinformatics  2014;15(1):373.
In order to extract meaningful information from electronic medical records, such as signs and symptoms, diagnoses, and treatments, it is important to take into account the contextual properties of the identified information: negation, temporality, and experiencer. Most work on automatic identification of these contextual properties has been done on English clinical text. This study presents ContextD, an adaptation of the English ConText algorithm to the Dutch language, and a Dutch clinical corpus.
We created a Dutch clinical corpus containing four types of anonymized clinical documents: entries from general practitioners, specialists’ letters, radiology reports, and discharge letters. Using a Dutch list of medical terms extracted from the Unified Medical Language System, we identified medical terms in the corpus with exact matching. The identified terms were annotated for negation, temporality, and experiencer properties. To adapt the ConText algorithm, we translated English trigger terms to Dutch and added several general and document specific enhancements, such as negation rules for general practitioners’ entries and a regular expression based temporality module.
The ContextD algorithm utilized 41 unique triggers to identify the contextual properties in the clinical corpus. For the negation property, the algorithm obtained an F-score from 87% to 93% for the different document types. For the experiencer property, the F-score was 99% to 100%. For the historical and hypothetical values of the temporality property, F-scores ranged from 26% to 54% and from 13% to 44%, respectively.
The ContextD showed good performance in identifying negation and experiencer property values across all Dutch clinical document types. Accurate identification of the temporality property proved to be difficult and requires further work. The anonymized and annotated Dutch clinical corpus can serve as a useful resource for further algorithm development.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0373-3) contains supplementary material, which is available to authorized users.
PMCID: PMC4264258  PMID: 25432799
Contextual features; Negation detection; Dutch electronic medical records
24.  Effectiveness of an e-learning course in evidence-based medicine for foundation (internship) training 
To evaluate the educational effectiveness of a clinically integrated e-learning course for teaching basic evidence-based medicine (EBM) among postgraduate medical trainees compared to a traditional lecture-based course of equivalent content.
We conducted a cluster randomized controlled trial to compare a clinically integrated e-learning EBM course (intervention) to a lecture-based course (control) among postgraduate trainees at foundation or internship level in seven teaching hospitals in the UK West Midlands region. Knowledge gain among participants was measured with a validated instrument using multiple choice questions. Change in knowledge was compared between groups taking into account the cluster design and adjusted for covariates at baseline using generalized estimating equations (GEE) model.
There were seven clusters involving teaching of 237 trainees (122 in the intervention and 115 in the control group). The total number of postgraduate trainees who completed the course was 88 in the intervention group and 72 in the control group. After adjusting for baseline knowledge, there was no difference in the amount of improvement in knowledge of EBM between the two groups. The adjusted post course difference between the intervention group and the control group was only 0.1 scoring points (95% CI −1.2–1.4).
An e-learning course in EBM was as effective in improving knowledge as a standard lecture-based course. The benefits of an e-learning approach need to be considered when planning EBM curricula as it allows standardization of teaching materials and is a potential cost-effective alternative to standard lecture-based teaching.
PMCID: PMC2895523  PMID: 20522698
25.  Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences 
To develop, evaluate, and share: (1) syntactic parsing guidelines for clinical text, with a new approach to handling ill-formed sentences; and (2) a clinical Treebank annotated according to the guidelines. To document the process and findings for readers with similar interest.
Using random samples from a shared natural language processing challenge dataset, we developed a handbook of domain-customized syntactic parsing guidelines based on iterative annotation and adjudication between two institutions. Special considerations were incorporated into the guidelines for handling ill-formed sentences, which are common in clinical text. Intra- and inter-annotator agreement rates were used to evaluate consistency in following the guidelines. Quantitative and qualitative properties of the annotated Treebank, as well as its use to retrain a statistical parser, were reported.
A supplement to the Penn Treebank II guidelines was developed for annotating clinical sentences. After three iterations of annotation and adjudication on 450 sentences, the annotators reached an F-measure agreement rate of 0.930 (while intra-annotator rate was 0.948) on a final independent set. A total of 1100 sentences from progress notes were annotated that demonstrated domain-specific linguistic features. A statistical parser retrained with combined general English (mainly news text) annotations and our annotations achieved an accuracy of 0.811 (higher than models trained purely with either general or clinical sentences alone). Both the guidelines and syntactic annotations are made available at
We developed guidelines for parsing clinical text and annotated a corpus accordingly. The high intra- and inter-annotator agreement rates showed decent consistency in following the guidelines. The corpus was shown to be useful in retraining a statistical parser that achieved moderate accuracy.
PMCID: PMC3822122  PMID: 23907286
natural language processing; syntactic parsing; annotation guidelines; corpus development

Results 1-25 (836980)