Summary: FACTA is a text search engine for MEDLINE abstracts, which is designed particularly to help users browse biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) appearing in the documents retrieved by the query. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics. Unlike existing systems that provide similar functionality, FACTA pre-indexes not only the words but also the concepts mentioned in the documents, which enables the user to issue a flexible query (e.g. free keywords or Boolean combinations of keywords/concepts) and receive the results immediately even when the number of the documents that match the query is very large. The user can also view snippets from MEDLINE to get textual evidence of associations between the query terms and the concepts. The concept IDs and their names/synonyms for building the indexes were collected from several biomedical databases and thesauri, such as UniProt, BioThesaurus, UMLS, KEGG and DrugBank.
Availability: The system is available at http://www.nactem.ac.uk/software/facta/
Research results are primarily published in scientific literature and curation efforts cannot keep up with the rapid growth of published literature. The plethora of knowledge remains hidden in large text repositories like MEDLINE. Consequently, life scientists have to spend a great amount of time searching for specific information. The enormous ambiguity among most names of biomedical objects such as genes, chemicals and diseases often produces too large and unspecific search results. We present GeneView, a semantic search engine for biomedical knowledge. GeneView is built upon a comprehensively annotated version of PubMed abstracts and openly available PubMed Central full texts. This semi-structured representation of biomedical texts enables a number of features extending classical search engines. For instance, users may search for entities using unique database identifiers or they may rank documents by the number of specific mentions they contain. Annotation is performed by a multitude of state-of-the-art text-mining tools for recognizing mentions from 10 entity classes and for identifying protein–protein interactions. GeneView currently contains annotations for >194 million entities from 10 classes for ∼21 million citations with 271 000 full text bodies. GeneView can be searched at http://bc3.informatik.hu-berlin.de/.
Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research.
Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative.
Availability: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing.
Supplementary data are available at Bioinformatics online.
The scientific literature represents a rich source for retrieval of knowledge on associations between biomedical concepts such as genes, diseases and cellular processes. A commonly used method to establish relationships between biomedical concepts from literature is co-occurrence. Apart from its use in knowledge retrieval, the co-occurrence method is also well-suited to discover new, hidden relationships between biomedical concepts following a simple ABC-principle, in which A and C have no direct relationship, but are connected via shared B-intermediates. In this paper we describe CoPub Discovery, a tool that mines the literature for new relationships between biomedical concepts. Statistical analysis using ROC curves showed that CoPub Discovery performed well over a wide range of settings and keyword thesauri. We subsequently used CoPub Discovery to search for new relationships between genes, drugs, pathways and diseases. Several of the newly found relationships were validated using independent literature sources. In addition, new predicted relationships between compounds and cell proliferation were validated and confirmed experimentally in an in vitro cell proliferation assay. The results show that CoPub Discovery is able to identify novel associations between genes, drugs, pathways and diseases that have a high probability of being biologically valid. This makes CoPub Discovery a useful tool to unravel the mechanisms behind disease, to find novel drug targets, or to find novel applications for existing drugs.
The biomedical literature is an important source of knowledge on the function of genes and on the mechanisms by which these genes regulate cellular processes. Several text mining approaches have been developed to leverage this rich source of information by automatically extracting associations between concepts such as genes, diseases and drugs from a large body of text. Here, we describe a new method that extracts novel, not yet recognized associations between genes, diseases, drugs and cellular processes from the biomedical literature. Our method is built on the assumption that even if two concepts do not have a direct connection in literature, they may be functionally related if they are both connected to an overlapping set of concepts. Using this approach we predicted several novel connections between genes, diseases, drugs and pathways. Our results imply that our method is able to predict novel relationships from literature and, most importantly, that these newly identified relationships are biologically relevant. Our method can aid the drug discovery process where it can be used to find novel drug targets, increase insight in mode of action of a drug or find novel applications for known drugs.
Manually curating knowledge from biomedical literature into structured databases is highly expensive and time-consuming, making it difficult to keep pace with the rapid growth of the literature. There is therefore a pressing need to assist biocuration with automated text mining tools. Here, we describe PubTator, a web-based system for assisting biocuration. PubTator is different from the few existing tools by featuring a PubMed-like interface, which many biocurators find familiar, and being equipped with multiple challenge-winning text mining algorithms to ensure the quality of its automatic results. Through a formal evaluation with two external user groups, PubTator was shown to be capable of improving both the efficiency and accuracy of manual curation. PubTator is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/.
The biomedical literature is represented by millions of abstracts available in the Medline database. These abstracts can be queried with the PubMed interface, which provides a keyword-based Boolean search engine. This approach shows limitations in the retrieval of abstracts related to very specific topics, as it is difficult for a non-expert user to find all of the most relevant keywords related to a biomedical topic. Additionally, when searching for more general topics, the same approach may return hundreds of unranked references. To address these issues, text mining tools have been developed to help scientists focus on relevant abstracts. We have implemented the MedlineRanker webserver, which allows a flexible ranking of Medline for a topic of interest without expert knowledge. Given some abstracts related to a topic, the program deduces automatically the most discriminative words in comparison to a random selection. These words are used to score other abstracts, including those from not yet annotated recent publications, which can be then ranked by relevance. We show that our tool can be highly accurate and that it is able to process millions of abstracts in a practical amount of time. MedlineRanker is free for use and is available at http://cbdm.mdc-berlin.de/tools/medlineranker.
MEDLINE®/PubMed® currently indexes over 18 million biomedical articles, providing unprecedented opportunities and challenges for text analysis. Using Medical Subject Heading Over-representation Profiles (MeSHOPs), an entity of interest can be robustly summarized, quantitatively identifying associated biomedical terms and predicting novel indirect associations.
A procedure is introduced for quantitative comparison of MeSHOPs derived from a group of MEDLINE® articles for a biomedical topic (for example, articles for a specific gene or disease). Similarity scores are computed to compare MeSHOPs of genes and diseases.
Similarity scores successfully infer novel associations between diseases and genes. The number of papers addressing a gene or disease has a strong influence on predicted associations, revealing an important bias for gene-disease relationship prediction. Predictions derived from comparisons of MeSHOPs achieves a mean 8% AUC improvement in the identification of gene-disease relationships compared to gene-independent baseline properties.
MeSHOP comparisons are demonstrated to provide predictive capacity for novel relationships between genes and human diseases. We demonstrate the impact of literature bias on the performance of gene-disease prediction methods. MeSHOPs provide a rich source of annotation to facilitate relationship discovery in biomedical informatics.
An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research.
We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies.
Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.
Summary: Often, the most informative genes have to be selected from different gene sets and several computer gene ranking algorithms have been developed to cope with the problem. To help researchers decide which algorithm to use, we developed the analysis of gene ranking algorithms (AGRA) system that offers a novel technique for comparing ranked lists of genes. The most important feature of AGRA is that no previous knowledge of gene ranking algorithms is needed for their comparison. Using the text mining system finding-associated concepts with text analysis. AGRA defines what we call biomedical concept space (BCS) for each gene list and offers a comparison of the gene lists in six different BCS categories. The uploaded gene lists can be compared using two different methods. In the first method, the overlap between each pair of two gene lists of BCSs is calculated. The second method offers a text field where a specific biomedical concept can be entered. AGRA searches for this concept in each gene lists' BCS, highlights the rank of the concept and offers a visual representation of concepts ranked above and below it.
Availability and Implementation: Available at http://agra.fzv.uni-mb.si/, implemented in Java and running on the Glassfish server.
A major problem faced in biomedical informatics involves how best to present information retrieval results. When a single query retrieves many results, simply showing them as a long list often provides poor overview. With a goal of presenting users with reduced sets of relevant citations, this study developed an approach that retrieved and organized MEDLINE citations into different topical groups and prioritized important citations in each group.
A text mining system framework for automatic document clustering and ranking organized MEDLINE citations following simple PubMed queries. The system grouped the retrieved citations, ranked the citations in each cluster, and generated a set of keywords and MeSH terms to describe the common theme of each cluster.
Several possible ranking functions were compared, including citation count per year (CCPY), citation count (CC), and journal impact factor (JIF). We evaluated this framework by identifying as “important” those articles selected by the Surgical Oncology Society.
Our results showed that CCPY outperforms CC and JIF, i.e., CCPY better ranked important articles than did the others. Furthermore, our text clustering and knowledge extraction strategy grouped the retrieval results into informative clusters as revealed by the keywords and MeSH terms extracted from the documents in each cluster.
The text mining system studied effectively integrated text clustering, text summarization, and text ranking and organized MEDLINE retrieval results into different topical groups.
Motivation: A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations.
Results: We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder—a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases.
Discussion: Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles.
Availability: Freely available at: http://bioinf.umbc.edu/EMU/ftp.
Supplementary information: Supplementary data are available at Bioinformatics online.
Advances in biotechnology and in high-throughput methods for gene analysis have contributed to an exponential increase in the number of scientific publications in these fields of study. While much of the data and results described in these articles are entered and annotated in the various existing biomedical databases, the scientific literature is still the major source of information. There is, therefore, a growing need for text mining and information retrieval tools to help researchers find the relevant articles for their study. To tackle this, several tools have been proposed to provide alternative solutions for specific user requests.
This paper presents QuExT, a new PubMed-based document retrieval and prioritization tool that, from a given list of genes, searches for the most relevant results from the literature. QuExT follows a concept-oriented query expansion methodology to find documents containing concepts related to the genes in the user input, such as protein and pathway names. The retrieved documents are ranked according to user-definable weights assigned to each concept class. By changing these weights, users can modify the ranking of the results in order to focus on documents dealing with a specific concept. The method's performance was evaluated using data from the 2004 TREC genomics track, producing a mean average precision of 0.425, with an average of 4.8 and 31.3 relevant documents within the top 10 and 100 retrieved abstracts, respectively.
QuExT implements a concept-based query expansion scheme that leverages gene-related information available on a variety of biological resources. The main advantage of the system is to give the user control over the ranking of the results by means of a simple weighting scheme. Using this approach, researchers can effortlessly explore the literature regarding a group of genes and focus on the different aspects relating to these genes.
Sickle cell disease (SCD) is a fatal monogenic disorder with no effective cure and thus high rates of morbidity and sequelae. Efforts toward discovery of disease modifying drugs and curative strategies can be augmented by leveraging the plethora of information contained in available biomedical literature. To facilitate research in this direction we have developed a resource, Dragon Exploration System for Sickle Cell Disease (DESSCD) (http://cbrc.kaust.edu.sa/desscd/) that aims to promote the easy exploration of SCD-related data.
The Dragon Exploration System (DES), developed based on text mining and complemented by data mining, processed 419,612 MEDLINE abstracts retrieved from a PubMed query using SCD-related keywords. The processed SCD-related data has been made available via the DESSCD web query interface that enables: a/information retrieval using specified concepts, keywords and phrases, and b/the generation of inferred association networks and hypotheses. The usefulness of the system is demonstrated by: a/reproducing a known scientific fact, the “Sickle_Cell_Anemia–Hydroxyurea” association, and b/generating novel and plausible “Sickle_Cell_Anemia–Hydroxyfasudil” hypothesis. A PCT patent (PCT/US12/55042) has been filed for the latter drug repurposing for SCD treatment.
We developed the DESSCD resource dedicated to exploration of text-mined and data-mined information about SCD. No similar SCD-related resource exists. Thus, we anticipate that DESSCD will serve as a valuable tool for physicians and researchers interested in SCD.
One important concept in traditional Chinese medicine (TCM) is "treating different diseases with the same therapy". In TCM practice, some patients with Rheumatoid Arthritis (RA) and some other patients with Coronary Heart Disease (CHD) can be treated with similar therapies. This suggests that there might be something commonly existed between RA and CHD, for example, biological networks or biological basis. As the amount of biomedical data in leading databases (i.e., PubMed, SinoMed, etc.) is growing at an exponential rate, it might be possible to get something interesting and meaningful through the techniques developed in data mining.
Based on the large data sets of Western medicine literature (PubMed) and traditional Chinese medicine literature (SinoMed), by applying data slicing algorithm in text mining, we retrieved some simple and meaningful networks. The Chinese herbs used in treatment of both RA and CHD, might affect the commonly existed networks between RA and CHD. This might support the TCM concept of treating different diseases with the same therapy.
First, the data mining results might show the positive answer that there are biological basis/networks commonly existed in both RA and CHD. Second, there are basic Chinese herbs used in the treatment of both RA and CHD. Third, these commonly existed networks might be affected by the basic Chinese herbs. Forth, discrete derivative, the data slicing algorithm is feasible in mining out useful data from literature of PubMed and SinoMed.
One of the most neglected areas of biomedical Text Mining (TM) is the development of systems based on carefully assessed user needs. We have recently investigated the user needs of an important task yet to be tackled by TM -- Cancer Risk Assessment (CRA). Here we take the first step towards the development of TM technology for the task: identifying and organizing the scientific evidence required for CRA in a taxonomy which is capable of supporting extensive data gathering from biomedical literature.
The taxonomy is based on expert annotation of 1297 abstracts downloaded from relevant PubMed journals. It classifies 1742 unique keywords found in the corpus to 48 classes which specify core evidence required for CRA. We report promising results with inter-annotator agreement tests and automatic classification of PubMed abstracts to taxonomy classes. A simple user test is also reported in a near real-world CRA scenario which demonstrates along with other evaluation that the resources we have built are well-defined, accurate, and applicable in practice.
We present our annotation guidelines and a tool which we have designed for expert annotation of PubMed abstracts. A corpus annotated for keywords and document relevance is also presented, along with the taxonomy which organizes the keywords into classes defining core evidence for CRA. As demonstrated by the evaluation, the materials we have constructed provide a good basis for classification of CRA literature along multiple dimensions. They can support current manual CRA as well as facilitate the development of an approach based on TM. We discuss extending the taxonomy further via manual and machine learning approaches and the subsequent steps required to develop TM technology for the needs of CRA.
Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated.
Motivation: There has recently been a notable shift in biomedical information extraction (IE) from relation models toward the more expressive event model, facilitated by the maturation of basic tools for biomedical text analysis and the availability of manually annotated resources. The event model allows detailed representation of complex natural language statements and can support a number of advanced text mining applications ranging from semantic search to pathway extraction. A recent collaborative evaluation demonstrated the potential of event extraction systems, yet there have so far been no studies of the generalization ability of the systems nor the feasibility of large-scale extraction.
Results: This study considers event-based IE at PubMed scale. We introduce a system combining publicly available, state-of-the-art methods for domain parsing, named entity recognition and event extraction, and test the system on a representative 1% sample of all PubMed citations. We present the first evaluation of the generalization performance of event extraction systems to this scale and show that despite its computational complexity, event extraction from the entire PubMed is feasible. We further illustrate the value of the extraction approach through a number of analyses of the extracted information.
Availability: The event detection system and extracted data are open source licensed and available at http://bionlp.utu.fi/.
MEDLINE®/PubMed® indexes over 20 million biomedical articles, providing curated annotation of its contents using a controlled vocabulary known as Medical Subject Headings (MeSH). The MeSH vocabulary, developed over 50+ years, provides a broad coverage of topics across biomedical research. Distilling the essential biomedical themes for a topic of interest from the relevant literature is important to both understand the importance of related concepts and discover new relationships.
We introduce a novel method for determining enriched curator-assigned MeSH annotations in a set of papers associated to a topic, such as a gene, an author or a disease. We generate MeSH Over-representation Profiles (MeSHOPs) to quantitatively summarize the annotations in a form convenient for further computational analysis and visualization. Based on a hypergeometric distribution of assigned terms, MeSHOPs statistically account for the prevalence of the associated biomedical annotation while highlighting unusually prevalent terms based on a specified background. MeSHOPs can be visualized using word clouds, providing a succinct quantitative graphical representation of the relative importance of terms. Using the publication dates of articles, MeSHOPs track changing patterns of annotation over time. Since MeSHOPs are quantitative vectors, MeSHOPs can be compared using standard techniques such as hierarchical clustering. The reliability of MeSHOP annotations is assessed based on the capacity to re-derive the subset of the Gene Ontology annotations with equivalent MeSH terms.
MeSHOPs allows quantitative measurement of the degree of association between any entity and the annotated medical concepts, based directly on relevant primary literature. Comparison of MeSHOPs allows entities to be related based on shared medical themes in their literature. A web interface is provided for generating and visualizing MeSHOPs.
Explore the automated acquisition of knowledge in biomedical and clinical documents using text mining and statistical techniques to identify disease-drug associations.
Biomedical literature and clinical narratives from the patient record were mined to gather knowledge about disease-drug associations. Two NLP systems, BioMedLEE and MedLEE, were applied to Medline articles and discharge summaries, respectively. Disease and drug entities were identified using the NLP systems in addition to MeSH annotations for the Medline articles. Focusing on eight diseases, co-occurrence statistics were applied to compute and evaluate the strength of association between each disease and relevant drugs.
Ranked lists of disease-drug pairs were generated and cutoffs calculated for identifying stronger associations among these pairs for further analysis. Differences and similarities between the text sources (i.e., biomedical literature and patient record) and annotations (i.e., MeSH and NLP-extracted UMLS concepts) with regards to disease-drug knowledge were observed.
This paper presents a method for acquiring disease-specific knowledge and a feasibility study of the method. The method is based on applying a combination of NLP and statistical techniques to both biomedical and clinical documents. The approach enabled extraction of knowledge about the drugs clinicians are using for patients with specific diseases based on the patient record, while it is also acquired knowledge of drugs frequently involved in controlled trials for those same diseases. In comparing the disease-drug associations, we found the results to be appropriate: the two text sources contained consistent as well as complementary knowledge, and manual review of the top five disease-drug associations by a medical expert supported their correctness across the diseases.
Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.
Our research goal is to explore text mining from the metadata included in MEDLINE documents. We present MeSHmap our prototype text mining system that exploits the MeSH indexing accompanying MEDLINE records. MeSHmap supports searches via PubMed followed by user driven exploration of the MeSH terms and subheadings in the retrieved set. The potential of the system goes beyond text retrieval. It may also be used to compare entities of the same type such as pairs of drugs or pairs of procedures etc. In addition there is the potential to generate maps of entities (drugs or diseases etc.) such that the strength of the link between two entities in the map represents their similarity as expressed in the MeSH metadata of the MEDLINE documents. Higher level operators have been proposed to support these comparison and mapping functions. This paper motivates and describes MeSHmap. Future work will include user evaluations of the system.
Image content is frequently the target of biomedical information extraction systems. However, the meaning of this content cannot be easily understood without some associated text. In order to improve the integration of textual and visual information, we are developing a visual ontology for biomedical image retrieval. Our visual ontology maps the appearance of image regions to concepts in an existing textual ontology, thereby inheriting relationships among the visual entities. Such a resource creates a bridge between the visual characteristics of important image regions and their semantic interpretation. We automatically populate our visual ontology by pairing image regions with their associated descriptions. To demonstrate the usefulness of this resource, we have developed a classification method that automatically labels image regions with appropriate concepts based solely on their appearance. Our results for thoracic imaging terms show that our methods are promising first steps towards the creation of a biomedical visual ontology.
The UMLS Metathesaurus, the largest thesaurus in the biomedical domain, provides a representation of biomedical knowledge consisting of concepts classified by semantic type and both hierarchical and non-hierarchical relationships among the concepts. This knowledge has proved useful for many applications including decision support systems, management of patient records, information retrieval (IR) and data mining. Gaining effective access to the knowledge is critical to the success of these applications. This paper describes MetaMap, a program developed at the National Library of Medicine (NLM) to map biomedical text to the Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge intensive approach based on symbolic, natural language processing (NLP) and computational linguistic techniques. Besides being applied for both IR and data mining applications, MetaMap is one of the foundations of NLM's Indexing Initiative System which is being applied to both semi-automatic and fully automatic indexing of the biomedical literature at the library.
Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study.
Results: Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments.
Conclusion: Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data.
Availability and implementation: Source code is available under a BSD license from http://sourceforge.net/projects/text2genome/ and results can be browsed and downloaded at http://text2genome.org.
Supplementary information: Supplementary data are available at Bioinformatics online.
Tools to automatically summarize gene information from the literature have the potential to help genomics researchers better interpret gene expression data and investigate biological pathways. The task of finding information on sets of genes is common for genomic researchers, and PubMed is still the first choice because the most recent and original information can only be found in the unstructured, free text biomedical literature. However, finding information on a set of genes by manually searching and scanning the literature is a time-consuming and daunting task for scientists. We built and evaluated a query-based automatic summarizer of information on mouse genes studied in microarray experiments. The system clusters a set of genes by MeSH, GO and free text features and presents summaries for each gene by ranked sentences extracted from MEDLINE abstracts. Evaluation showed that the system seems to provide meaningful clusters and informative sentences are ranked higher by the algorithm.