Search tips
Search criteria

Results 1-25 (606625)

Clipboard (0)

Related Articles

1.  Finding biomarkers in non-model species: literature mining of transcription factors involved in bovine embryo development 
BioData Mining  2012;5:12.
Since processes in well-known model organisms have specific features different from those in Bos taurus, the organism under study, a good way to describe gene regulation in ruminant embryos would be a species-specific consideration of closely related species to cattle, sheep and pig. However, as highlighted by a recent report, gene dictionaries in pig are smaller than in cattle, bringing a risk to reduce the gene resources to be mined (and so for sheep dictionaries). Bioinformatics approaches that allow an integration of available information on gene function in model organisms, taking into account their specificity, are thus needed. Besides these closely related and biologically relevant species, there is indeed much more knowledge of (i) trophoblast proliferation and differentiation or (ii) embryogenesis in human and mouse species, which provides opportunities for reconstructing proliferation and/or differentiation processes in other mammalian embryos, including ruminants. The necessary knowledge can be obtained partly from (i) stem cell or cancer research to supply useful information on molecular agents or molecular interactions at work in cell proliferation and (ii) mouse embryogenesis to supply useful information on embryo differentiation. However, the total number of publications for all these topics and species is great and their manual processing would be tedious and time consuming. This is why we used text mining for automated text analysis and automated knowledge extraction. To evaluate the quality of this “mining”, we took advantage of studies that reported gene expression profiles during the elongation of bovine embryos and defined a list of transcription factors (or TF, n = 64) that we used as biological “gold standard”. When successful, the “mining” approach would identify them all, as well as novel ones.
To gain knowledge on molecular-genetic regulations in a non model organism, we offer an approach based on literature-mining and score arrangement of data from model organisms. This approach was applied to identify novel transcription factors during bovine blastocyst elongation, a process that is not observed in rodents and primates. As a result, searching through human and mouse corpuses, we identified numerous bovine homologs, among which 11 to 14% of transcription factors including the gold standard TF as well as novel TF potentially important to gene regulation in ruminant embryo development. The scripts of the workflow are written in Perl and available on demand. They require data input coming from all various databases for any kind of biological issue once the data has been prepared according to keywords for the studied topic and species; we can provide data sample to illustrate the use and functionality of the workflow.
To do so, we created a workflow that allowed the pipeline processing of literature data and biological data, extracted from Web of Science (WoS) or PubMed but also from Gene Expression Omnibus (GEO), Gene Ontology (GO), Uniprot, HomoloGene, TcoF-DB and TFe (TF encyclopedia). First, the human and mouse homologs of the bovine proteins were selected, filtered by text corpora and arranged by score functions. The score functions were based on the gene name frequencies in corpora. Then, transcription factors were identified using TcoF-DB and double-checked using TFe to characterise TF groups and families. Thus, among a search space of 18,670 bovine homologs, 489 were identified as transcription factors. Among them, 243 were absent from the high-throughput data available at the time of the study. They thus stand so far for putative TF acting during bovine embryo elongation, but might be retrieved from a recent RNA sequencing dataset (Mamo et al. , 2012). Beyond the 246 TF that appeared expressed in bovine elongating tissues, we restricted our interpretation to those occurring within a list of 50 top-ranked genes. Among the transcription factors identified therein, half belonged to the gold standard (ASCL2, c-FOS, ETS2, GATA3, HAND1) and half did not (ESR1, HES1, ID2, NANOG, PHB2, TP53, STAT3).
A workflow providing search for transcription factors acting in bovine elongation was developed. The model assumed that proteins sharing the same protein domains in closely related species had the same protein functionalities, even if they were differently regulated among species or involved in somewhat different pathways. Under this assumption, we merged the information on different mammalian species from different databases (literature and biology) and proposed 489 TF as potential participants of embryo proliferation and differentiation, with (i) a recall of 95% with regard to a biological gold standard defined in 2011 and (ii) an extension of more than 3 times the gold standard of TF detected so far in elongating tissues. The working capacity of the workflow was supported by the manual expertise of the biologists on the results. The workflow can serve as a new kind of bioinformatics tool to work on fused data sources and can thus be useful in studies of a wide range of biological processes.
PMCID: PMC3563503  PMID: 22931563
2.  Development and tuning of an original search engine for patent libraries in medicinal chemistry 
BMC Bioinformatics  2014;15(Suppl 1):S15.
The large increase in the size of patent collections has led to the need of efficient search strategies. But the development of advanced text-mining applications dedicated to patents of the biomedical field remains rare, in particular to address the needs of the pharmaceutical & biotech industry, which intensively uses patent libraries for competitive intelligence and drug development.
We describe here the development of an advanced retrieval engine to search information in patent collections in the field of medicinal chemistry. We investigate and combine different strategies and evaluate their respective impact on the performance of the search engine applied to various search tasks, which covers the putatively most frequent search behaviours of intellectual property officers in medical chemistry: 1) a prior art search task; 2) a technical survey task; and 3) a variant of the technical survey task, sometimes called known-item search task, where a single patent is targeted.
The optimal tuning of our engine resulted in a top-precision of 6.76% for the prior art search task, 23.28% for the technical survey task and 46.02% for the variant of the technical survey task. We observed that co-citation boosting was an appropriate strategy to improve prior art search tasks, while IPC classification of queries was improving retrieval effectiveness for technical survey tasks. Surprisingly, the use of the full body of the patent was always detrimental for search effectiveness. It was also observed that normalizing biomedical entities using curated dictionaries had simply no impact on the search tasks we evaluate. The search engine was finally implemented as a web-application within Novartis Pharma. The application is briefly described in the report.
We have presented the development of a search engine dedicated to patent search, based on state of the art methods applied to patent corpora. We have shown that a proper tuning of the system to adapt to the various search tasks clearly increases the effectiveness of the system. We conclude that different search tasks demand different information retrieval engines' settings in order to yield optimal end-user retrieval.
PMCID: PMC4015144  PMID: 24564220
3.  CDAPubMed: a browser extension to retrieve EHR-based biomedical literature 
Over the last few decades, the ever-increasing output of scientific publications has led to new challenges to keep up to date with the literature. In the biomedical area, this growth has introduced new requirements for professionals, e.g., physicians, who have to locate the exact papers that they need for their clinical and research work amongst a huge number of publications. Against this backdrop, novel information retrieval methods are even more necessary. While web search engines are widespread in many areas, facilitating access to all kinds of information, additional tools are required to automatically link information retrieved from these engines to specific biomedical applications. In the case of clinical environments, this also means considering aspects such as patient data security and confidentiality or structured contents, e.g., electronic health records (EHRs). In this scenario, we have developed a new tool to facilitate query building to retrieve scientific literature related to EHRs.
We have developed CDAPubMed, an open-source web browser extension to integrate EHR features in biomedical literature retrieval approaches. Clinical users can use CDAPubMed to: (i) load patient clinical documents, i.e., EHRs based on the Health Level 7-Clinical Document Architecture Standard (HL7-CDA), (ii) identify relevant terms for scientific literature search in these documents, i.e., Medical Subject Headings (MeSH), automatically driven by the CDAPubMed configuration, which advanced users can optimize to adapt to each specific situation, and (iii) generate and launch literature search queries to a major search engine, i.e., PubMed, to retrieve citations related to the EHR under examination.
CDAPubMed is a platform-independent tool designed to facilitate literature searching using keywords contained in specific EHRs. CDAPubMed is visually integrated, as an extension of a widespread web browser, within the standard PubMed interface. It has been tested on a public dataset of HL7-CDA documents, returning significantly fewer citations since queries are focused on characteristics identified within the EHR. For instance, compared with more than 200,000 citations retrieved by breast neoplasm, fewer than ten citations were retrieved when ten patient features were added using CDAPubMed. This is an open source tool that can be freely used for non-profit purposes and integrated with other existing systems.
PMCID: PMC3366875  PMID: 22480327
4.  Information from Pharmaceutical Companies and the Quality, Quantity, and Cost of Physicians' Prescribing: A Systematic Review 
PLoS Medicine  2010;7(10):e1000352.
Geoff Spurling and colleagues report findings of a systematic review looking at the relationship between exposure to promotional material from pharmaceutical companies and the quality, quantity, and cost of prescribing. They fail to find evidence of improvements in prescribing after exposure, and find some evidence of an association with higher prescribing frequency, higher costs, or lower prescribing quality.
Pharmaceutical companies spent $57.5 billion on pharmaceutical promotion in the United States in 2004. The industry claims that promotion provides scientific and educational information to physicians. While some evidence indicates that promotion may adversely influence prescribing, physicians hold a wide range of views about pharmaceutical promotion. The objective of this review is to examine the relationship between exposure to information from pharmaceutical companies and the quality, quantity, and cost of physicians' prescribing.
Methods and Findings
We searched for studies of physicians with prescribing rights who were exposed to information from pharmaceutical companies (promotional or otherwise). Exposures included pharmaceutical sales representative visits, journal advertisements, attendance at pharmaceutical sponsored meetings, mailed information, prescribing software, and participation in sponsored clinical trials. The outcomes measured were quality, quantity, and cost of physicians' prescribing. We searched Medline (1966 to February 2008), International Pharmaceutical Abstracts (1970 to February 2008), Embase (1997 to February 2008), Current Contents (2001 to 2008), and Central (The Cochrane Library Issue 3, 2007) using the search terms developed with an expert librarian. Additionally, we reviewed reference lists and contacted experts and pharmaceutical companies for information. Randomized and observational studies evaluating information from pharmaceutical companies and measures of physicians' prescribing were independently appraised for methodological quality by two authors. Studies were excluded where insufficient study information precluded appraisal. The full text of 255 articles was retrieved from electronic databases (7,185 studies) and other sources (138 studies). Articles were then excluded because they did not fulfil inclusion criteria (179) or quality appraisal criteria (18), leaving 58 included studies with 87 distinct analyses. Data were extracted independently by two authors and a narrative synthesis performed following the MOOSE guidelines. Of the set of studies examining prescribing quality outcomes, five found associations between exposure to pharmaceutical company information and lower quality prescribing, four did not detect an association, and one found associations with lower and higher quality prescribing. 38 included studies found associations between exposure and higher frequency of prescribing and 13 did not detect an association. Five included studies found evidence for association with higher costs, four found no association, and one found an association with lower costs. The narrative synthesis finding of variable results was supported by a meta-analysis of studies of prescribing frequency that found significant heterogeneity. The observational nature of most included studies is the main limitation of this review.
With rare exceptions, studies of exposure to information provided directly by pharmaceutical companies have found associations with higher prescribing frequency, higher costs, or lower prescribing quality or have not found significant associations. We did not find evidence of net improvements in prescribing, but the available literature does not exclude the possibility that prescribing may sometimes be improved. Still, we recommend that practitioners follow the precautionary principle and thus avoid exposure to information from pharmaceutical companies.
Please see later in the article for the Editors' Summary
Editors' Summary
A prescription drug is a medication that can be supplied only with a written instruction (“prescription”) from a physician or other licensed healthcare professional. In 2009, 3.9 billion drug prescriptions were dispensed in the US alone and US pharmaceutical companies made US$300 billion in sales revenue. Every year, a large proportion of this revenue is spent on drug promotion. In 2004, for example, a quarter of US drug revenue was spent on pharmaceutical promotion. The pharmaceutical industry claims that drug promotion—visits from pharmaceutical sales representatives, advertisements in journals and prescribing software, sponsorship of meetings, mailed information—helps to inform and educate healthcare professionals about the risks and benefits of their products and thereby ensures that patients receive the best possible care. Physicians, however, hold a wide range of views about pharmaceutical promotion. Some see it as a useful and convenient source of information. Others deny that they are influenced by pharmaceutical company promotion but claim that it influences other physicians. Meanwhile, several professional organizations have called for tighter control of promotional activities because of fears that pharmaceutical promotion might encourage physicians to prescribe inappropriate or needlessly expensive drugs.
Why Was This Study Done?
But is there any evidence that pharmaceutical promotion adversely influences prescribing? Reviews of the research literature undertaken in 2000 and 2005 provide some evidence that drug promotion influences prescribing behavior. However, these reviews only partly assessed the relationship between information from pharmaceutical companies and prescribing costs and quality and are now out of date. In this study, therefore, the researchers undertake a systematic review (a study that uses predefined criteria to identify all the research on a given topic) to reexamine the relationship between exposure to information from pharmaceutical companies and the quality, quantity, and cost of physicians' prescribing.
What Did the Researchers Do and Find?
The researchers searched the literature for studies of licensed physicians who were exposed to promotional and other information from pharmaceutical companies. They identified 58 studies that included a measure of exposure to any type of information directly provided by pharmaceutical companies and a measure of physicians' prescribing behavior. They then undertook a “narrative synthesis,” a descriptive analysis of the data in these studies. Ten of the studies, they report, examined the relationship between exposure to pharmaceutical company information and prescribing quality (as judged, for example, by physician drug choices in response to clinical vignettes). All but one of these studies suggested that exposure to drug company information was associated with lower prescribing quality or no association was detected. In the 51 studies that examined the relationship between exposure to drug company information and prescribing frequency, exposure to information was associated with more frequent prescribing or no association was detected. Thus, for example, 17 out of 29 studies of the effect of pharmaceutical sales representatives' visits found an association between visits and increased prescribing; none found an association with less frequent prescribing. Finally, eight studies examined the relationship between exposure to pharmaceutical company information and prescribing costs. With one exception, these studies indicated that exposure to information was associated with a higher cost of prescribing or no association was detected. So, for example, one study found that physicians with low prescribing costs were more likely to have rarely or never read promotional mail or journal advertisements from pharmaceutical companies than physicians with high prescribing costs.
What Do These Findings Mean?
With rare exceptions, these findings suggest that exposure to pharmaceutical company information is associated with either no effect on physicians' prescribing behavior or with adverse affects (reduced quality, increased frequency, or increased costs). Because most of the studies included in the review were observational studies—the physicians in the studies were not randomly selected to receive or not receive drug company information—it is not possible to conclude that exposure to information actually causes any changes in physician behavior. Furthermore, although these findings provide no evidence for any net improvement in prescribing after exposure to pharmaceutical company information, the researchers note that it would be wrong to conclude that improvements do not sometimes happen. The findings support the case for reforms to reduce negative influence to prescribing from pharmaceutical promotion.
Additional Information
Please access these Web sites via the online version of this summary at
Wikipedia has pages on prescription drugs and on pharmaceutical marketing (note that Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
The UK General Medical Council provides guidelines on good practice in prescribing medicines
The US Food and Drug Administration provides information on prescription drugs and on its Bad Ad Program
Healthy Skepticism is an international nonprofit membership association that aims to improve health by reducing harm from misleading health information
The Drug Promotion Database was developed by the World Health Organization Department of Essential Drugs & Medicines Policy and Health Action International Europe to address unethical and inappropriate drug promotion
PMCID: PMC2957394  PMID: 20976098
5.  PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites 
Nucleic Acids Research  2008;36(Web Server issue):W399-W405.
A particular challenge in biomedical text mining is to find ways of handling ‘comprehensive’ or ‘associative’ queries such as ‘Find all genes associated with breast cancer’. Given that many queries in genomics, proteomics or metabolomics involve these kind of comprehensive searches we believe that a web-based tool that could support these searches would be quite useful. In response to this need, we have developed the PolySearch web server. PolySearch supports >50 different classes of queries against nearly a dozen different types of text, scientific abstract or bioinformatic databases. The typical query supported by PolySearch is ‘Given X, find all Y's’ where X or Y can be diseases, tissues, cell compartments, gene/protein names, SNPs, mutations, drugs and metabolites. PolySearch also exploits a variety of techniques in text mining and information retrieval to identify, highlight and rank informative abstracts, paragraphs or sentences. PolySearch's performance has been assessed in tasks such as gene synonym identification, protein–protein interaction identification and disease gene identification using a variety of manually assembled ‘gold standard’ text corpuses. Its f-measure on these tasks is 88, 81 and 79%, respectively. These values are between 5 and 50% better than other published tools. The server is freely available at
PMCID: PMC2447794  PMID: 18487273
6.  Overview of the protein-protein interaction annotation extraction task of BioCreative II 
Genome Biology  2008;9(Suppl 2):S4.
The biomedical literature is the primary information source for manual protein-protein interaction annotations. Text-mining systems have been implemented to extract binary protein interactions from articles, but a comprehensive comparison between the different techniques as well as with manual curation was missing.
We designed a community challenge, the BioCreative II protein-protein interaction (PPI) task, based on the main steps of a manual protein interaction annotation workflow. It was structured into four distinct subtasks related to: (a) detection of protein interaction-relevant articles; (b) extraction and normalization of protein interaction pairs; (c) retrieval of the interaction detection methods used; and (d) retrieval of actual text passages that provide evidence for protein interactions. A total of 26 teams submitted runs for at least one of the proposed subtasks. In the interaction article detection subtask, the top scoring team reached an F-score of 0.78. In the interaction pair extraction and mapping to SwissProt, a precision of 0.37 (with recall of 0.33) was obtained. For associating articles with an experimental interaction detection method, an F-score of 0.65 was achieved. As for the retrieval of the PPI passages best summarizing a given protein interaction in full-text articles, 19% of the submissions returned by one of the runs corresponded to curator-selected sentences. Curators extracted only the passages that best summarized a given interaction, implying that many of the automatically extracted ones could contain interaction information but did not correspond to the most informative sentences.
The BioCreative II PPI task is the first attempt to compare the performance of text-mining tools specific for each of the basic steps of the PPI extraction pipeline. The challenges identified range from problems in full-text format conversion of articles to difficulties in detecting interactor protein pairs and then linking them to their database records. Some limitations were also encountered when using a single (and possibly incomplete) reference database for protein normalization or when limiting search for interactor proteins to co-occurrence within a single sentence, when a mention might span neighboring sentences. Finally, distinguishing between novel, experimentally verified interactions (annotation relevant) and previously known interactions adds additional complexity to these tasks.
PMCID: PMC2559988  PMID: 18834495
7.  The Human EST Ontology Explorer: a tissue-oriented visualization system for ontologies distribution in human EST collections 
BMC Bioinformatics  2009;10(Suppl 12):S2.
The NCBI dbEST currently contains more than eight million human Expressed Sequenced Tags (ESTs). This wide collection represents an important source of information for gene expression studies, provided it can be inspected according to biologically relevant criteria. EST data can be browsed using different dedicated web resources, which allow to investigate library specific gene expression levels and to make comparisons among libraries, highlighting significant differences in gene expression. Nonetheless, no tool is available to examine distributions of quantitative EST collections in Gene Ontology (GO) categories, nor to retrieve information concerning library-dependent EST involvement in metabolic pathways. In this work we present the Human EST Ontology Explorer (HEOE) , a web facility for comparison of expression levels among libraries from several healthy and diseased tissues.
The HEOE provides library-dependent statistics on the distribution of sequences in the GO Direct Acyclic Graph (DAG) that can be browsed at each GO hierarchical level. The tool is based on large-scale BLAST annotation of EST sequences. Due to the huge number of input sequences, this BLAST analysis was performed with the aid of grid computing technology, which is particularly suitable to address data parallel task. Relying on the achieved annotation, library-specific distributions of ESTs in the GO Graph were inferred. A pathway-based search interface was also implemented, for a quick evaluation of the representation of libraries in metabolic pathways. EST processing steps were integrated in a semi-automatic procedure that relies on Perl scripts and stores results in a MySQL database. A PHP-based web interface offers the possibility to simultaneously visualize, retrieve and compare data from the different libraries. Statistically significant differences in GO categories among user selected libraries can also be computed.
The HEOE provides an alternative and complementary way to inspect EST expression levels with respect to approaches currently offered by other resources. Furthermore, BLAST computation on the whole human EST dataset was a suitable test of grid scalability in the context of large-scale bioinformatics analysis. The HEOE currently comprises sequence analysis from 70 non-normalized libraries, representing a comprehensive overview on healthy and unhealthy tissues. As the analysis procedure can be easily applied to other libraries, the number of represented tissues is intended to increase.
PMCID: PMC2762067  PMID: 19828078
8.  A Topic Clustering Approach to Finding Similar Questions from Large Question and Answer Archives 
PLoS ONE  2014;9(3):e71511.
With the blooming of Web 2.0, Community Question Answering (CQA) services such as Yahoo! Answers (, WikiAnswer (, and Baidu Zhidao (, etc., have emerged as alternatives for knowledge and information acquisition. Over time, a large number of question and answer (Q&A) pairs with high quality devoted by human intelligence have been accumulated as a comprehensive knowledge base. Unlike the search engines, which return long lists of results, searching in the CQA services can obtain the correct answers to the question queries by automatically finding similar questions that have already been answered by other users. Hence, it greatly improves the efficiency of the online information retrieval. However, given a question query, finding the similar and well-answered questions is a non-trivial task. The main challenge is the word mismatch between question query (query) and candidate question for retrieval (question). To investigate this problem, in this study, we capture the word semantic similarity between query and question by introducing the topic modeling approach. We then propose an unsupervised machine-learning approach to finding similar questions on CQA Q&A archives. The experimental results show that our proposed approach significantly outperforms the state-of-the-art methods.
PMCID: PMC3942313  PMID: 24595052
9.  SciMiner: web-based literature mining tool for target identification and functional enrichment analysis 
Bioinformatics  2009;25(6):838-840.
Summary:SciMiner is a web-based literature mining and functional analysis tool that identifies genes and proteins using a context specific analysis of MEDLINE abstracts and full texts. SciMiner accepts a free text query (PubMed Entrez search) or a list of PubMed identifiers as input. SciMiner uses both regular expression patterns and dictionaries of gene symbols and names compiled from multiple sources. Ambiguous acronyms are resolved by a scoring scheme based on the co-occurrence of acronyms and corresponding description terms, which incorporates optional user-defined filters. Functional enrichment analyses are used to identify highly relevant targets (genes and proteins), GO (Gene Ontology) terms, MeSH (Medical Subject Headings) terms, pathways and protein–protein interaction networks by comparing identified targets from one search result with those from other searches or to the full HGNC [HUGO (Human Genome Organization) Gene Nomenclature Committee] gene set. The performance of gene/protein name identification was evaluated using the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) version 2 (Year 2006) Gene Normalization Task as a gold standard. SciMiner achieved 87.1% recall, 71.3% precision and 75.8% F-measure. SciMiner's literature mining performance coupled with functional enrichment analyses provides an efficient platform for retrieval and summary of rich biological information from corpora of users' interests.
Availability: A server version of the SciMiner is also available for download and enables users to utilize their institution's journal subscriptions.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2654801  PMID: 19188191
10.  User centered and ontology based information retrieval system for life sciences 
BMC Bioinformatics  2012;13(Suppl 1):S4.
Because of the increasing number of electronic resources, designing efficient tools to retrieve and exploit them is a major challenge. Some improvements have been offered by semantic Web technologies and applications based on domain ontologies. In life science, for instance, the Gene Ontology is widely exploited in genomic applications and the Medical Subject Headings is the basis of biomedical publications indexation and information retrieval process proposed by PubMed. However current search engines suffer from two main drawbacks: there is limited user interaction with the list of retrieved resources and no explanation for their adequacy to the query is provided. Users may thus be confused by the selection and have no idea on how to adapt their queries so that the results match their expectations.
This paper describes an information retrieval system that relies on domain ontology to widen the set of relevant documents that is retrieved and that uses a graphical rendering of query results to favor user interactions. Semantic proximities between ontology concepts and aggregating models are used to assess documents adequacy with respect to a query. The selection of documents is displayed in a semantic map to provide graphical indications that make explicit to what extent they match the user's query; this man/machine interface favors a more interactive and iterative exploration of data corpus, by facilitating query concepts weighting and visual explanation. We illustrate the benefit of using this information retrieval system on two case studies one of which aiming at collecting human genes related to transcription factors involved in hemopoiesis pathway.
The ontology based information retrieval system described in this paper (OBIRS) is freely available at: This environment is a first step towards a user centred application in which the system enlightens relevant information to provide decision help.
PMCID: PMC3434427  PMID: 22373375
11.  Personalized online information search and visualization 
The rapid growth of online publications such as the Medline and other sources raises the questions how to get the relevant information efficiently. It is important, for a bench scientist, e.g., to monitor related publications constantly. It is also important, for a clinician, e.g., to access the patient records anywhere and anytime. Although time-consuming, this kind of searching procedure is usually similar and simple. Likely, it involves a search engine and a visualization interface. Different words or combination reflects different research topics. The objective of this study is to automate this tedious procedure by recording those words/terms in a database and online sources, and use the information for an automated search and retrieval. The retrieved information will be available anytime and anywhere through a secure web server.
We developed such a database that stored searching terms, journals and et al., and implement a piece of software for searching the medical subject heading-indexed sources such as the Medline and other online sources automatically. The returned information were stored locally, as is, on a server and visible through a Web-based interface. The search was performed daily or otherwise scheduled and the users logon to the website anytime without typing any words. The system has potentials to retrieve similarly from non-medical subject heading-indexed literature or a privileged information source such as a clinical information system. The issues such as security, presentation and visualization of the retrieved information were thus addressed. One of the presentation issues such as wireless access was also experimented. A user survey showed that the personalized online searches saved time and increased and relevancy. Handheld devices could also be used to access the stored information but less satisfactory.
The Web-searching software or similar system has potential to be an efficient tool for both bench scientists and clinicians for their daily information needs.
PMCID: PMC1079857  PMID: 15766382
12.  Extracting semantically enriched events from biomedical literature 
BMC Bioinformatics  2012;13:108.
Research into event-based text mining from the biomedical literature has been growing in popularity to facilitate the development of advanced biomedical text mining systems. Such technology permits advanced search, which goes beyond document or sentence-based retrieval. However, existing event-based systems typically ignore additional information within the textual context of events that can determine, amongst other things, whether an event represents a fact, hypothesis, experimental result or analysis of results, whether it describes new or previously reported knowledge, and whether it is speculated or negated. We refer to such contextual information as meta-knowledge. The automatic recognition of such information can permit the training of systems allowing finer-grained searching of events according to the meta-knowledge that is associated with them.
Based on a corpus of 1,000 MEDLINE abstracts, fully manually annotated with both events and associated meta-knowledge, we have constructed a machine learning-based system that automatically assigns meta-knowledge information to events. This system has been integrated into EventMine, a state-of-the-art event extraction system, in order to create a more advanced system (EventMine-MK) that not only extracts events from text automatically, but also assigns five different types of meta-knowledge to these events. The meta-knowledge assignment module of EventMine-MK performs with macro-averaged F-scores in the range of 57-87% on the BioNLP’09 Shared Task corpus. EventMine-MK has been evaluated on the BioNLP’09 Shared Task subtask of detecting negated and speculated events. Our results show that EventMine-MK can outperform other state-of-the-art systems that participated in this task.
We have constructed the first practical system that extracts both events and associated, detailed meta-knowledge information from biomedical literature. The automatically assigned meta-knowledge information can be used to refine search systems, in order to provide an extra search layer beyond entities and assertions, dealing with phenomena such as rhetorical intent, speculations, contradictions and negations. This finer grained search functionality can assist in several important tasks, e.g., database curation (by locating new experimental knowledge) and pathway enrichment (by providing information for inference). To allow easy integration into text mining systems, EventMine-MK is provided as a UIMA component that can be used in the interoperable text mining infrastructure, U-Compare.
PMCID: PMC3464657  PMID: 22621266
13.  P13-S Database Protein Information Searching Engine via Internet: PIKE 
One of the main goals in proteomics is to extract and collect all the functional information available in existing databases in relation to a defined set of identified proteins. Due to the huge amount of data available, it is not possible to gather up this information by hand; we need to have automatic methods for addressing this task.
Protein information and knowledge extractor (PIKE) solves this problem by accessing several public information systems and databases automatically through the Internet and retrieving all functional information available on the different repositories, and then clustering this information according to the pre-selected criteria. The PIKE bioinformatics tool, accessible through, uses the Java and XML languages. Starting with a selected group of identified proteins, listed as NCBI nr, uniprot, and/or ipi ( accession codes, PIKE retrieves all relevant information stored in databases by choosing the correct pathway and/or the best information source.
Once the search is done, a typical PIKE output shows a report table with an entry for each protein containing all extracted information. The report contains a large amount of meaningful protein features, such as (1) function information, (2) sub-cellular location, (3) tissue specificity, (4) links with other repositories, such as Mendelian Inheritance in Man (OMIM) or Kyoto Encyclopaedia of Genes and Genomes (KEGG), and (5) gene ontology tree classification. The table is exportable in CSV and text file formats, and, more important, it is possible to export it in PRIDE XML ( format for results integration into the information stored in other applications such as ProteinScape.
PMCID: PMC2292065
14.  Mining biomarker information in biomedical literature 
For selection and evaluation of potential biomarkers, inclusion of already published information is of utmost importance. In spite of significant advancements in text- and data-mining techniques, the vast knowledge space of biomarkers in biomedical text has remained unexplored. Existing named entity recognition approaches are not sufficiently selective for the retrieval of biomarker information from the literature. The purpose of this study was to identify textual features that enhance the effectiveness of biomarker information retrieval for different indication areas and diverse end user perspectives.
A biomarker terminology was created and further organized into six concept classes. Performance of this terminology was optimized towards balanced selectivity and specificity. The information retrieval performance using the biomarker terminology was evaluated based on various combinations of the terminology's six classes. Further validation of these results was performed on two independent corpora representing two different neurodegenerative diseases.
The current state of the biomarker terminology contains 119 entity classes supported by 1890 different synonyms. The result of information retrieval shows improved retrieval rate of informative abstracts, which is achieved by including clinical management terms and evidence of gene/protein alterations (e.g. gene/protein expression status or certain polymorphisms) in combination with disease and gene name recognition. When additional filtering through other classes (e.g. diagnostic or prognostic methods) is applied, the typical high number of unspecific search results is significantly reduced. The evaluation results suggest that this approach enables the automated identification of biomarker information in the literature. A demo version of the search engine SCAIView, including the biomarker retrieval, is made available to the public through
The approach presented in this paper demonstrates that using a dedicated biomarker terminology for automated analysis of the scientific literature maybe helpful as an aid to finding biomarker information in text. Successful extraction of candidate biomarkers information from published resources can be considered as the first step towards developing novel hypotheses. These hypotheses will be valuable for the early decision-making in the drug discovery and development process.
PMCID: PMC3541249  PMID: 23249606
Text-mining; Biomarker discovery; Information retrieval; Terminology
15.  Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization 
PLoS ONE  2013;8(4):e55814.
Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access ( Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license.
PMCID: PMC3629104  PMID: 23613707
16.  BioCreative III interactive task: an overview 
BMC Bioinformatics  2011;12(Suppl 8):S4.
The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested.
A User Advisory Group (UAG) actively participated in the IAT design and assessment. The task focused on gene normalization (identifying gene mentions in the article and linking these genes to standard database identifiers), gene ranking based on the overall importance of each gene mentioned in the article, and gene-oriented document retrieval (identifying full text papers relevant to a selected gene). Six systems participated and all processed and displayed the same set of articles. The articles were selected based on content known to be problematic for curation, such as ambiguity of gene names, coverage of multiple genes and species, or introduction of a new gene name. Members of the UAG curated three articles for training and assessment purposes, and each member was assigned a system to review. A questionnaire related to the interface usability and task performance (as measured by precision and recall) was answered after systems were used to curate articles. Although the limited number of articles analyzed and users involved in the IAT experiment precluded rigorous quantitative analysis of the results, a qualitative analysis provided valuable insight into some of the problems encountered by users when using the systems. The overall assessment indicates that the system usability features appealed to most users, but the system performance was suboptimal (mainly due to low accuracy in gene normalization). Some of the issues included failure of species identification and gene name ambiguity in the gene normalization task leading to an extensive list of gene identifiers to review, which, in some cases, did not contain the relevant genes. The document retrieval suffered from the same shortfalls. The UAG favored achieving high performance (measured by precision and recall), but strongly recommended the addition of features that facilitate the identification of correct gene and its identifier, such as contextual information to assist in disambiguation.
The IAT was an informative exercise that advanced the dialog between curators and developers and increased the appreciation of challenges faced by each group. A major conclusion was that the intended users should be actively involved in every phase of software development, and this will be strongly encouraged in future tasks. The IAT Task provides the first steps toward the definition of metrics and functional requirements that are necessary for designing a formal evaluation of interactive curation systems in the BioCreative IV challenge.
PMCID: PMC3269939  PMID: 22151968
17.  Google Scholar as replacement for systematic literature searches: good relative recall and precision are not enough 
Recent research indicates a high recall in Google Scholar searches for systematic reviews. These reports raised high expectations of Google Scholar as a unified and easy to use search interface. However, studies on the coverage of Google Scholar rarely used the search interface in a realistic approach but instead merely checked for the existence of gold standard references. In addition, the severe limitations of the Google Search interface must be taken into consideration when comparing with professional literature retrieval tools.
The objectives of this work are to measure the relative recall and precision of searches with Google Scholar under conditions which are derived from structured search procedures conventional in scientific literature retrieval; and to provide an overview of current advantages and disadvantages of the Google Scholar search interface in scientific literature retrieval.
General and MEDLINE-specific search strategies were retrieved from 14 Cochrane systematic reviews. Cochrane systematic review search strategies were translated to Google Scholar search expression as good as possible under consideration of the original search semantics. The references of the included studies from the Cochrane reviews were checked for their inclusion in the result sets of the Google Scholar searches. Relative recall and precision were calculated.
We investigated Cochrane reviews with a number of included references between 11 and 70 with a total of 396 references. The Google Scholar searches resulted in sets between 4,320 and 67,800 and a total of 291,190 hits. The relative recall of the Google Scholar searches had a minimum of 76.2% and a maximum of 100% (7 searches). The precision of the Google Scholar searches had a minimum of 0.05% and a maximum of 0.92%. The overall relative recall for all searches was 92.9%, the overall precision was 0.13%.
The reported relative recall must be interpreted with care. It is a quality indicator of Google Scholar confined to an experimental setting which is unavailable in systematic retrieval due to the severe limitations of the Google Scholar search interface. Currently, Google Scholar does not provide necessary elements for systematic scientific literature retrieval such as tools for incremental query optimization, export of a large number of references, a visual search builder or a history function. Google Scholar is not ready as a professional searching tool for tasks where structured retrieval methodology is necessary.
PMCID: PMC3840556  PMID: 24160679
Literature search; Literature search; Methods; Literature search; Systematic; Information storage and retrieval; Google Scholar; Medline
18.  COM3/369: Knowledge-based Information Systems: A new approach for the representation and retrieval of medical information 
Present solutions for the representation and retrieval of medical information from online sources are not very satisfying. Either the retrieval process lacks of precision and completeness the representation does not support the update and maintenance of the represented information. Most efforts are currently put into improving the combination of search engines and HTML based documents. However, due to the current shortcomings of methods for natural language understanding there are clear limitations to this approach. Furthermore, this approach does not solve the maintenance problem. At least medical information exceeding a certain complexity seems to afford approaches that rely on structured knowledge representation and corresponding retrieval mechanisms.
Knowledge-based information systems are based on the following fundamental ideas. The representation of information is based on ontologies that define the structure of the domain's concepts and their relations. Views on domain models are defined and represented as retrieval schemata. Retrieval schemata can be interpreted as canonical query types focussing on specific aspects of the provided information (e.g. diagnosis or therapy centred views). Based on these retrieval schemata it can be decided which parts of the information in the domain model must be represented explicitly and formalised to support the retrieval process. As representation language propositional logic is used. All other information can be represented in a structured but informal way using text, images etc. Layout schemata are used to assign layout information to retrieved domain concepts. Depending on the target environment HTML or XML can be used.
Based on this approach two knowledge-based information systems have been developed. The 'Ophthalmologic Knowledge-based Information System for Diabetic Retinopathy' (OKIS-DR) provides information on diagnoses, findings, examinations, guidelines, and reference images related to diabetic retinopathy. OKIS-DR uses combinations of findings to specify the information that must be retrieved. The second system focuses on nutrition related allergies and intolerances. Information on allergies and intolerances of a patient are used to retrieve general information on the specified combination of allergies and intolerances. As a special feature the system generates tables showing food types and products that are tolerated or not tolerated by patients. Evaluation by external experts and user groups showed that the described approach of knowledge-based information systems increases the precision and completeness of knowledge retrieval. Due to the structured and non-redundant representation of information the maintenance and update of the information can be simplified. Both systems are available as WWW based online knowledge bases and CD-ROMs (cf. topic: products).
PMCID: PMC1761778
Knowledge-based Information Systems; Knowledge-based Systems; Information Retrieval
19.  MED21/393: ICD as a Search Tool for Medical Internet Resources 
The Internet offers information on specific diseases on WWW-server distributed over the entire world. Selecting the appropriate information resource is a non-trivial task. Several approaches (like MeSH, MedPix) exist to describe the contents of Web pages. However, the majority of the Web sites do not make use of these schemas. Automatic (or semi-automatic) linking of medical records to information resources on the internet is not sufficiently supported. ICD (International Classification of Diseases) is the standard for coding the diagnosis in medical records. The purpose of this study was to evaluate the feasibility of ICD - based search tools.
A general ICD Meta-Search Engine (ICD-Search) was established to perform ICD-based Internet searches. From the records of the Großhadern University Hospital, the 20 most frequent diagnosis were selected. The diseases are coded with ICD-9 by the medical personnel of the hospital. The English version of the ICD-9 was chosen for performing the internet study. General search engines (Yahoo, Lycos, AltaVista) were used to retrieve medical information corresponding to the selected ICD codes. The first 50 hits were included in the study. The Web pages were scored according to accessibility (information could be retrieved) of the Web pages and the contents (reflects ICD code). The quality of information (current state of knowledge, complete information, comprehensive presentation) was not take into account. Also the results of the different search engines were compared. This study was repeated with specialized medical search engines (MedHunt) . Scoring of the results was performed accordingly.
Both Lycos and AltaVista searched resulted in a large number of hits. The range for AltaVista was between 114.440 (Coronary atherosclerosis) and 405.770 (Alcohol dependence syndrome). The Lycos search resulted in no categories and four sites (Coronary atherosclerosis) and one site for Alcohol dependence syndrome. MedHunt found 131 corresponding sites for Coronary atherosclerosis and 112 for Alcohol dependence syndrome. The median scores for Lycos and Altavista were in the same order with some extreme low scores for AltaVista on selected diseases. Lycos, in general, scored better than Lycos and AltaVista but show a low number of hits for several diseases. MedHunt produced both good scores and a sufficient number of hits. The results from Lycos and AltaVista were compared. It showed, that only 15% of the web pages were found by both search engines.
All search engines found Internet sites that corresponded to the selected ICD-codes. The standard vocabulary provided by ICD proved to be a good basis for linking medical diagnosis with Internet Web sites. Specialized medical search engines perform better than general ones. Search engines, that also give information on the quality of information, offer additional value.
PMCID: PMC1761824
Medical Informatics Applications; Medical Record Linkage
20.  Concept-based query expansion for retrieving gene related publications from MEDLINE 
BMC Bioinformatics  2010;11:212.
Advances in biotechnology and in high-throughput methods for gene analysis have contributed to an exponential increase in the number of scientific publications in these fields of study. While much of the data and results described in these articles are entered and annotated in the various existing biomedical databases, the scientific literature is still the major source of information. There is, therefore, a growing need for text mining and information retrieval tools to help researchers find the relevant articles for their study. To tackle this, several tools have been proposed to provide alternative solutions for specific user requests.
This paper presents QuExT, a new PubMed-based document retrieval and prioritization tool that, from a given list of genes, searches for the most relevant results from the literature. QuExT follows a concept-oriented query expansion methodology to find documents containing concepts related to the genes in the user input, such as protein and pathway names. The retrieved documents are ranked according to user-definable weights assigned to each concept class. By changing these weights, users can modify the ranking of the results in order to focus on documents dealing with a specific concept. The method's performance was evaluated using data from the 2004 TREC genomics track, producing a mean average precision of 0.425, with an average of 4.8 and 31.3 relevant documents within the top 10 and 100 retrieved abstracts, respectively.
QuExT implements a concept-based query expansion scheme that leverages gene-related information available on a variety of biological resources. The main advantage of the system is to give the user control over the ranking of the results by means of a simple weighting scheme. Using this approach, researchers can effortlessly explore the literature regarding a group of genes and focus on the different aspects relating to these genes.
PMCID: PMC2873540  PMID: 20426836
21.  Clinician Search Behaviors May Be Influenced by Search Engine Design 
Searching the Web for documents using information retrieval systems plays an important part in clinicians’ practice of evidence-based medicine. While much research focuses on the design of methods to retrieve documents, there has been little examination of the way different search engine capabilities influence clinician search behaviors.
Previous studies have shown that use of task-based search engines allows for faster searches with no loss of decision accuracy compared with resource-based engines. We hypothesized that changes in search behaviors may explain these differences.
In all, 75 clinicians (44 doctors and 31 clinical nurse consultants) were randomized to use either a resource-based or a task-based version of a clinical information retrieval system to answer questions about 8 clinical scenarios in a controlled setting in a university computer laboratory. Clinicians using the resource-based system could select 1 of 6 resources, such as PubMed; clinicians using the task-based system could select 1 of 6 clinical tasks, such as diagnosis. Clinicians in both systems could reformulate search queries. System logs unobtrusively capturing clinicians’ interactions with the systems were coded and analyzed for clinicians’ search actions and query reformulation strategies.
The most frequent search action of clinicians using the resource-based system was to explore a new resource with the same query, that is, these clinicians exhibited a “breadth-first” search behaviour. Of 1398 search actions, clinicians using the resource-based system conducted 401 (28.7%, 95% confidence interval [CI] 26.37-31.11) in this way. In contrast, the majority of clinicians using the task-based system exhibited a “depth-first” search behavior in which they reformulated query keywords while keeping to the same task profiles. Of 585 search actions conducted by clinicians using the task-based system, 379 (64.8%, 95% CI 60.83-68.55) were conducted in this way.
This study provides evidence that different search engine designs are associated with different user search behaviors.
PMCID: PMC2956236  PMID: 20601351
Clinician; search behavior; information retrieval; Internet
22.  A Framework and Methodology for Navigating Disaster and Global Health in Crisis Literature 
PLoS Currents  2013;5:ecurrents.dis.9af6948e381dafdd3e877c441527cba0.
Both ‘disasters’ and ‘global health in crisis’ research has dramatically grown due to the ever-increasing frequency and magnitude of crises around the world. Large volumes of peer-reviewed literature are not only a testament to the field’s value and evolution, but also present an unprecedented outpouring of seemingly unmanageable information across a wide array of crises and disciplines. Disaster medicine, health and humanitarian assistance, global health and public health disaster literature all lie within the disaster and global health in crisis literature spectrum and are increasingly accepted as multidisciplinary and transdisciplinary disciplines. Researchers, policy makers, and practitioners now face a new challenge; that of accessing this expansive literature for decision-making and exploring new areas of research. Individuals are also reaching beyond the peer-reviewed environment to grey literature using search engines like Google Scholar to access policy documents, consensus reports and conference proceedings. What is needed is a method and mechanism with which to search and retrieve relevant articles from this expansive body of literature. This manuscript presents both a framework and workable process for a diverse group of users to navigate the growing peer-reviewed and grey disaster and global health in crises literature. Methods: Disaster terms from textbooks, peer-reviewed and grey literature were used to design a framework of thematic clusters and subject matter ‘nodes’. A set of 84 terms, selected from 143 curated terms was organized within each node reflecting topics within the disaster and global health in crisis literature. Terms were crossed with one another and the term ‘disaster’. The results were formatted into tables and matrices. This process created a roadmap of search terms that could be applied to the PubMed database. Each search in the matrix or table results in a listed number of articles. This process was applied to literature from PubMed from 2005-2011. A complementary process was also applied to Google Scholar using the same framework of clusters, nodes, and terms expanding the search process to include the broader grey literature assets. Results: A framework of four thematic clusters and twelve subject matter nodes were designed to capture diverse disaster and global health in crisis-related content. From 2005-2011 there were 18,660 articles referring to the term [disaster]. Restricting the search to human research, MeSH, and English language there remained 7,736 identified articles representing an unmanageable number to adequately process for research, policy or best practices. However, using the crossed search and matrix process revealed further examples of robust realms of research in disasters, emergency medicine, EMS, public health and global health. Examples of potential gaps in current peer-reviewed disaster and global health in crisis literature were identified as mental health, elderly care, and alternate sites of care. The same framework and process was then applied to Google Scholar, specifically for topics that resulted in few PubMed search returns. When applying the same framework and process to the Google Scholar example searches retrieved unique peer-reviewed articles not identified in PubMed and documents including books, governmental documents and consensus papers. Conclusions: The proposed framework, methodology and process using four clusters, twelve nodes and a matrix and table process applied to PubMed and Google Scholar unlocks otherwise inaccessible opportunities to better navigate the massively growing body of peer-reviewed disaster and global health in crises literature. This approach will assist researchers, policy makers, and practitioners to generate future research questions, report on the overall evolution of the disaster and global health in crisis field and further guide disaster planning, prevention, preparedness, mitigation response and recovery.
PMCID: PMC3625621  PMID: 23591457
23.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA 
BMC Bioinformatics  2005;6(Suppl 1):S17.
The Gene Ontology Annotation (GOA) database aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded.
Consequently, semi-automated aids are often employed to expedite the curation process. Traditionally, electronic techniques in GOA depend largely on exploiting the knowledge in existing resources such as InterPro. However, in recent years, text mining has been hailed as a potentially useful tool to aid the curation process.
To encourage the development of such tools, the GOA team at EBI agreed to take part in the functional annotation task of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge.
BioCreAtIvE task 2 was an experiment to test if automatically derived classification using information retrieval and extraction could assist expert biologists in the annotation of the GO vocabulary to the proteins in the UniProt Knowledgebase.
GOA provided the training corpus of over 9000 manual GO annotations extracted from the literature. For the test set, we provided a corpus of 200 new Journal of Biological Chemistry articles used to annotate 286 human proteins with GO terms. A team of experts manually evaluated the results of 9 participating groups, each of which provided highlighted sentences to support their GO and protein annotation predictions. Here, we give a biological perspective on the evaluation, explain how we annotate GO using literature and offer some suggestions to improve the precision of future text-retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as an assessment of our current electronic GO annotation strategies.
The GOA database currently extracts GO annotation from the literature with 91 to 100% precision, and at least 72% recall. This creates a particularly high threshold for text mining systems which in BioCreAtIvE task 2 (GO annotation extraction and retrieval) initial results precisely predicted GO terms only 10 to 20% of the time.
Improvements in the performance and accuracy of text mining for GO terms should be expected in the next BioCreAtIvE challenge. In the meantime the manual and electronic GO annotation strategies already employed by GOA will provide high quality annotations.
PMCID: PMC1869009  PMID: 15960829
24.  PLAN2L: a web tool for integrated text mining and literature-based bioentity relation extraction 
Nucleic Acids Research  2009;37(Web Server issue):W160-W165.
There is an increasing interest in using literature mining techniques to complement information extracted from annotation databases or generated by bioinformatics applications. Here we present PLAN2L, a web-based online search system that integrates text mining and information extraction techniques to access systematically information useful for analyzing genetic, cellular and molecular aspects of the plant model organism Arabidopsis thaliana. Our system facilitates a more efficient retrieval of information relevant to heterogeneous biological topics, from implications in biological relationships at the level of protein interactions and gene regulation, to sub-cellular locations of gene products and associations to cellular and developmental processes, i.e. cell cycle, flowering, root, leaf and seed development. Beyond single entities, also predefined pairs of entities can be provided as queries for which literature-derived relations together with textual evidences are returned. PLAN2L does not require registration and is freely accessible at
PMCID: PMC2703909  PMID: 19520768
25.  Evaluation of BioCreAtIvE assessment of task 2 
BMC Bioinformatics  2005;6(Suppl 1):S16.
Molecular Biology accumulated substantial amounts of data concerning functions of genes and proteins. Information relating to functional descriptions is generally extracted manually from textual data and stored in biological databases to build up annotations for large collections of gene products. Those annotation databases are crucial for the interpretation of large scale analysis approaches using bioinformatics or experimental techniques. Due to the growing accumulation of functional descriptions in biomedical literature the need for text mining tools to facilitate the extraction of such annotations is urgent. In order to make text mining tools useable in real world scenarios, for instance to assist database curators during annotation of protein function, comparisons and evaluations of different approaches on full text articles are needed.
The Critical Assessment for Information Extraction in Biology (BioCreAtIvE) contest consists of a community wide competition aiming to evaluate different strategies for text mining tools, as applied to biomedical literature. We report on task two which addressed the automatic extraction and assignment of Gene Ontology (GO) annotations of human proteins, using full text articles. The predictions of task 2 are based on triplets of protein – GO term – article passage. The annotation-relevant text passages were returned by the participants and evaluated by expert curators of the GO annotation (GOA) team at the European Institute of Bioinformatics (EBI). Each participant could submit up to three results for each sub-task comprising task 2. In total more than 15,000 individual results were provided by the participants. The curators evaluated in addition to the annotation itself, whether the protein and the GO term were correctly predicted and traceable through the submitted text fragment.
Concepts provided by GO are currently the most extended set of terms used for annotating gene products, thus they were explored to assess how effectively text mining tools are able to extract those annotations automatically. Although the obtained results are promising, they are still far from reaching the required performance demanded by real world applications. Among the principal difficulties encountered to address the proposed task, were the complex nature of the GO terms and protein names (the large range of variants which are used to express proteins and especially GO terms in free text), and the lack of a standard training set. A range of very different strategies were used to tackle this task. The dataset generated in line with the BioCreative challenge is publicly available and will allow new possibilities for training information extraction methods in the domain of molecular biology.
PMCID: PMC1869008  PMID: 15960828

Results 1-25 (606625)