Figures are important experimental results that are typically reported in full-text bioscience articles. Bioscience researchers need to access figures to validate research facts and to formulate or to test novel research hypotheses. On the other hand, the sheer volume of bioscience literature has made it difficult to access figures. Therefore, we are developing an intelligent figure search engine (http://figuresearch.askhermes.org). Existing research in figure search treats each figure equally, but we introduce a novel concept of “figure ranking”: figures appearing in a full-text biomedical article can be ranked by their contribution to the knowledge discovery.
We empirically validated the hypothesis of figure ranking with over 100 bioscience researchers, and then developed unsupervised natural language processing (NLP) approaches to automatically rank figures. Evaluating on a collection of 202 full-text articles in which authors have ranked the figures based on importance, our best system achieved a weighted error rate of 0.2, which is significantly better than several other baseline systems we explored. We further explored a user interfacing application in which we built novel user interfaces (UIs) incorporating figure ranking, allowing bioscience researchers to efficiently access important figures. Our evaluation results show that 92% of the bioscience researchers prefer as the top two choices the user interfaces in which the most important figures are enlarged. With our automatic figure ranking NLP system, bioscience researchers preferred the UIs in which the most important figures were predicted by our NLP system than the UIs in which the most important figures were randomly assigned. In addition, our results show that there was no statistical difference in bioscience researchers' preference in the UIs generated by automatic figure ranking and UIs by human ranking annotation.
The evaluation results conclude that automatic figure ranking and user interfacing as we reported in this study can be fully implemented in online publishing. The novel user interface integrated with the automatic figure ranking system provides a more efficient and robust way to access scientific information in the biomedical domain, which will further enhance our existing figure search engine to better facilitate accessing figures of interest for bioscientists.
In identifying appropriate strategies for effective use of preventive services for particular settings or populations, public health practitioners employ a systematic approach to evaluating the literature. Behavioral intervention studies that focus on prevention, however, pose special challenges for these traditional methods. Tools for synthesizing evidence on preventive interventions can improve public health practice. The authors developed a literature abstraction tool and a classification for preventive interventions. They incorporated the tool into a PC-based relational database and user-friendly evidence reporting system, then tested the system by reviewing behavioral interventions for hypertension management. They performed a structured literature search and reviewed 100 studies on behavioral interventions for hypertension management. They abstracted information using the abstraction tool and classified important elements of interventions for comparison across studies. The authors found that many studies in their pilot project did not report sufficient information to allow for complete evaluation, comparison across studies, or replication of the intervention. They propose that studies reporting on preventive interventions should (a) categorize interventions into discrete components; (b) report sufficient participant information; and (c) report characteristics such as intervention leaders, timing, and setting so that public health professionals can compare and select the most appropriate interventions.
Reactome (http://www.reactome.org) is an expert-authored, peer-reviewed knowledgebase of human reactions and pathways that functions as a data mining resource and electronic textbook. Its current release includes 2975 human proteins, 2907 reactions and 4455 literature citations. A new entity-level pathway viewer and improved search and data mining tools facilitate searching and visualizing pathway data and the analysis of user-supplied high-throughput data sets. Reactome has increased its utility to the model organism communities with improved orthology prediction methods allowing pathway inference for 22 species and through collaborations to create manually curated Reactome pathway datasets for species including Arabidopsis, Oryza sativa (rice), Drosophila and Gallus gallus (chicken). Reactome's data content and software can all be freely used and redistributed under open source terms.
The biomedical literature grows at a tremendous rate and PubMed comprises already over 15 000 000 abstracts. Finding relevant literature is an important and difficult problem. We introduce GoPubMed, a web server which allows users to explore PubMed search results with the Gene Ontology (GO), a hierarchically structured vocabulary for molecular biology. GoPubMed provides the following benefits: first, it gives an overview of the literature abstracts by categorizing abstracts according to the GO and thus allowing users to quickly navigate through the abstracts by category. Second, it automatically shows general ontology terms related to the original query, which often do not even appear directly in the abstract. Third, it enables users to verify its classification because GO terms are highlighted in the abstracts and as each term is labelled with an accuracy percentage. Fourth, exploring PubMed abstracts with GoPubMed is useful as it shows definitions of GO terms without the need for further look up. GoPubMed is online at . Querying is currently limited to 100 papers per query.
The biomedical literature is represented by millions of abstracts available in the Medline database. These abstracts can be queried with the PubMed interface, which provides a keyword-based Boolean search engine. This approach shows limitations in the retrieval of abstracts related to very specific topics, as it is difficult for a non-expert user to find all of the most relevant keywords related to a biomedical topic. Additionally, when searching for more general topics, the same approach may return hundreds of unranked references. To address these issues, text mining tools have been developed to help scientists focus on relevant abstracts. We have implemented the MedlineRanker webserver, which allows a flexible ranking of Medline for a topic of interest without expert knowledge. Given some abstracts related to a topic, the program deduces automatically the most discriminative words in comparison to a random selection. These words are used to score other abstracts, including those from not yet annotated recent publications, which can be then ranked by relevance. We show that our tool can be highly accurate and that it is able to process millions of abstracts in a practical amount of time. MedlineRanker is free for use and is available at http://cbdm.mdc-berlin.de/tools/medlineranker.
BRENDA (BRaunschweig ENzyme DAtabase, http://www.brenda-enzymes.org) is a major resource for enzyme related information. First and foremost, it provides data which are manually curated from the primary literature. DRENDA (Disease RElated ENzyme information DAtabase) complements BRENDA with a focus on the automatic search and categorization of enzyme and disease related information from title and abstracts of primary publications. In a two-step procedure DRENDA makes use of text mining and machine learning methods.
Currently enzyme and disease related references are biannually updated as part of the standard BRENDA update. 910,897 relations of EC-numbers and diseases were extracted from titles or abstracts and are included in the second release in 2010. The enzyme and disease entity recognition has been successfully enhanced by a further relation classification via machine learning. The classification step has been evaluated by a 5-fold cross validation and achieves an F1 score between 0.802 ± 0.032 and 0.738 ± 0.033 depending on the categories and pre-processing procedures. In the eventual DRENDA content every category reaches a classification specificity of at least 96.7% and a precision that ranges from 86-98% in the highest confidence level, and 64-83% for the smallest confidence level associated with higher recall.
The DRENDA processing chain analyses PubMed, locates references with disease-related information on enzymes and categorises their focus according to the categories causal interaction, therapeutic application, diagnostic usage and ongoing research. The categorisation gives an impression on the focus of the located references. Thus, the relation categorisation can facilitate orientation within the rapidly growing number of references with impact on diseases and enzymes. The DRENDA information is available as additional information in BRENDA.
Keeping current with drug therapy information is challenging for health care practitioners. Technologies are often implemented to facilitate access to current and credible drug information sources. In the Canadian province of Nova Scotia, legislation was passed in 2002 to allow nurse practitioners (NPs) to practice collaboratively with physician partners. The purpose of this study was to determine the current utilization patterns of information technologies by these groups of practitioners.
Nurse practitioners and their collaborating physician partners in Nova Scotia were sent a survey in February 2005 to determine the frequency of use, usefulness, accessibility, credibility, and current/timeliness of personal digital assistant (PDA), computer, and print drug information resources. Two surveys were developed (one for PDA users and one for computer users) and revised based on a literature search, stakeholder consultation, and pilot-testing results. A second distribution to nonresponders occurred two weeks following the first. Data were entered and analysed with SPSS.
Twenty-seven (14 NPs and 13 physicians) of 36 (75%) recipients responded. 22% (6) returned personal digital assistant (PDA) surveys. Respondents reported print, health professionals, and online/electronic resources as the most to least preferred means to access drug information, respectively. 37% and 35% of respondents reported using "both print and electronic but print more than electronic" and "print only", respectively, to search monograph-related drug information queries whereas 4% reported using "PDA only". Analysis of respondent ratings for all resources in the categories print, health professionals and other, and online/electronic resources, indicated that the Compendium of Pharmaceuticals and Specialties and pharmacists ranked highly for frequency of use, usefulness, accessibility, credibility, and current/timeliness by both groups of practitioners. Respondents' preferences and resource ratings were consistent with self-reported methods for conducting drug information queries. Few differences existed between NP and physician rankings of resources.
The use of computers and PDAs remains limited, which is also consistent with preferred and frequent use of print resources. Education for these practitioners regarding available electronic drug information resources may facilitate future computer and PDA use. Further research is needed to determine methods to increase computer and PDA use and whether these technologies affect prescribing and patient outcomes.
PathBLAST is a network alignment and search tool for comparing protein interaction networks across species to identify protein pathways and complexes that have been conserved by evolution. The basic method searches for high-scoring alignments between pairs of protein interaction paths, for which proteins of the first path are paired with putative orthologs occurring in the same order in the second path. This technique discriminates between true- and false-positive interactions and allows for functional annotation of protein interaction pathways based on similarity to the network of another, well-characterized species. PathBLAST is now available at http://www.pathblast.org/ as a web-based query. In this implementation, the user specifies a short protein interaction path for query against a target protein–protein interaction network selected from a network database. PathBLAST returns a ranked list of matching paths from the target network along with a graphical view of these paths and the overlap among them. Target protein–protein interaction networks are currently available for Helicobacter pylori, Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster. Just as BLAST enables rapid comparison of protein sequences between genomes, tools such as PathBLAST are enabling comparative genomics at the network level.
With the overwhelming volume of genomic and molecular information available on many databases nowadays, researchers need from bioinformaticians more than encouragement to refine their searches. We present here GeneRanker, an online system that allows researchers to obtain a ranked list of genes potentially related to a specific disease or biological process by combining gene-disease (or genebiological process) associations with protein-protein interactions extracted from the literature, using computational analysis of the protein network topology to more accurately rank the predicted associations. GeneRanker was evaluated in the context of brain cancer research, and is freely available online at http://www.generanker.org.
The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine.
We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%.
This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address .
PubAtlas (www.pubatlas.org) is a web service and standalone program providing literature maps for the biomedical research literature. It accepts user-defined sets of terms (PubMed queries) as input, and permits ‘BLASTing’ of one set against another: for all terms x and y in these sets, deriving the results of the pairwise intersections x AND y. This all vs. all capability extends PubMed with a literature analysis interface. Correspondingly, the basic form of literature map that PubAtlas provides for exploring associations among sets of terms is an interactive tabular display, in heatmap/microarray format.
PubAtlas supports development of specialized lexica -- hierarchies of controlled terminology that can represent sets of related concepts or a ‘user-defined query language’. PubAtlas also provides historical perspectives on the literature, with temporal query features that highlight historical patterns. Generally, it is a framework for extending the PubMed interface, and an extensible platform for producing interactive literature maps.
InterWeaver is a web server for discovering potential protein interactions with online evidence automatically extracted from protein interaction databases, literature abstracts, domain fusion events and domain interactions. Given a new protein sequence, the server identifies potential interaction partners using two approaches. In the homology-based approach, the system performs sequence homology searches to find similar proteins in other species, and then searches the protein interaction databases and the biomedical literature for interaction partners. In the domain-based approach, the system detects the domains in the input protein sequence and searches databases of domain fusion events and putative domain interactions to suggest potential interacting partners. The results are compiled into a personalized and downloadable interaction report to aid biologists in their discovery of protein interactions. InterWeaver is freely available for academic users at http://interweaver.i2r.a-star.edu.sg/.
Since the inception of the GO annotation project, a variety of tools have been developed that support exploring and searching the GO database. In particular, a variety of tools that perform GO enrichment analysis are currently available. Most of these tools require as input a target set of genes and a background set and seek enrichment in the target set compared to the background set. A few tools also exist that support analyzing ranked lists. The latter typically rely on simulations or on union-bound correction for assigning statistical significance to the results.
GOrilla is a web-based application that identifies enriched GO terms in ranked lists of genes, without requiring the user to provide explicit target and background sets. This is particularly useful in many typical cases where genomic data may be naturally represented as a ranked list of genes (e.g. by level of expression or of differential expression). GOrilla employs a flexible threshold statistical approach to discover GO terms that are significantly enriched at the top of a ranked gene list. Building on a complete theoretical characterization of the underlying distribution, called mHG, GOrilla computes an exact p-value for the observed enrichment, taking threshold multiple testing into account without the need for simulations. This enables rigorous statistical analysis of thousand of genes and thousands of GO terms in order of seconds. The output of the enrichment analysis is visualized as a hierarchical structure, providing a clear view of the relations between enriched GO terms.
GOrilla is an efficient GO analysis tool with unique features that make a useful addition to the existing repertoire of GO enrichment tools. GOrilla's unique features and advantages over other threshold free enrichment tools include rigorous statistics, fast running time and an effective graphical representation. GOrilla is publicly available at:
Summary: MiSearch is an adaptive biomedical literature search tool that ranks citations based on a statistical model for the likelihood that a user will choose to view them. Citation selections are automatically acquired during browsing and used to dynamically update a likelihood model that includes authorship, journal and PubMed indexing information. The user can optionally elect to include or exclude specific features and vary the importance of timeliness in the ranking.
Supplementary information: Supplementary data are available at Bioinformatics online.
P3DB (http://www.p3db.org/) provides a resource of protein phosphorylation data from multiple plants. The database was initially constructed with a dataset from oilseed rape, including 14 670 nonredundant phosphorylation sites from 6382 substrate proteins, representing the largest collection of plant phosphorylation data to date. Additional protein phosphorylation data are being deposited into this database from large-scale studies of Arabidopsis thaliana and soybean. Phosphorylation data from current literature are also being integrated into the P3DB. With a web-based user interface, the database is browsable, downloadable and searchable by protein accession number, description and sequence. A BLAST utility was integrated and a phosphopeptide BLAST browser was implemented to allow users to query the database for phosphopeptides similar to protein sequences of their interest. With the large-scale phosphorylation data and associated web-based tools, P3DB will be a valuable resource for both plant and nonplant biologists in the field of protein phosphorylation.
The Arabidopsis Information Resource (TAIR, http://arabidopsis.org) is the model organism database for the fully sequenced and intensively studied model plant Arabidopsis thaliana. Data in TAIR is derived in large part from manual curation of the Arabidopsis research literature and direct submissions from the research community. New developments at TAIR include the addition of the GBrowse genome viewer to the TAIR site, a redesigned home page, navigation structure and portal pages to make the site more intuitive and easier to use, the launch of several TAIR web services and a new genome annotation release (TAIR7) in April 2007. A combination of manual and computational methods were used to generate this release, which contains 27 029 protein-coding genes, 3889 pseudogenes or transposable elements and 1123 ncRNAs (32 041 genes in all, 37 019 gene models). A total of 681 new genes and 1002 new splice variants were added. Overall, 10 098 loci (one-third of all loci from the previous TAIR6 release) were updated for the TAIR7 release.
Predicting RNA secondary structures is a very important task, and continues to be a challenging problem, even though several methods and algorithms are proposed in the literature. In this article, we propose an algorithm called Tfold, for predicting non-coding RNA secondary structures. Tfold takes as input a RNA sequence for which the secondary structure is searched and a set of aligned homologous sequences. It combines criteria of stability, conservation and covariation in order to search for stems and pseudoknots (whatever their type). Stems are searched recursively, from the most to the least stable. Tfold uses an algorithm called SSCA for selecting the most appropriate sequences from a large set of homologous sequences (taken from a database for example) to use for the prediction. Tfold can take into account one or several stems considered by the user as belonging to the secondary structure. Tfold can return several structures (if requested by the user) when ‘rival’ stems are found. Tfold has a complexity of O(n2), with n the sequence length. The developed software, which offers several different uses, is available on the web site: http://tfold.ibisc.univ-evry.fr/TFold.
WormBase (http://www.wormbase.org/) is a web-accessible central data repository for information about Caenorhabditis elegans and related nematodes. The past two years have seen a significant expansion in the biological scope of WormBase, including the integration of large-scale, genome-wide data sets, the inclusion of genome sequence and gene predictions from related species and active literature curation. This expansion of data has also driven the development and refinement of user interfaces and operability, including a new Genome Browser, new searches and facilities for data access and the inclusion of extensive documentation. These advances have expanded WormBase beyond the obvious target audience of C. elegans researchers, to include researchers wishing to explore problems in functional and comparative genomics within the context of a powerful genetic system.
Brownian dynamics (BD) in a suitably constructed potential of mean force is an efficient and accurate method for simulating ion transport through wide ion channels. Here, a web-based graphical user interface (GUI) is presented for grand canonical Monte Carlo (GCMC) BD simulations of channel proteins: http://www.charmm-gui.org/input/gcmcbd. The webserver is designed to help users avoid most of the technical difficulties and issues encountered in setting up and simulating complex pore systems. GCMC/BD simulation results for three proteins, the voltage dependent anion channel (VDAC), α-Hemolysin, and the protective antigen pore of the anthrax toxin (PA), are presented to illustrate system setup, input preparation, and typical output (conductance, ion density profile, ion selectivity, and ion asymmetry). Two models for the input diffusion constants for potassium and chloride ions in the pore are compared: scaling of the bulk diffusion constants by 0.5, as deduced from previous all-atom molecular dynamics simulations of VDAC; and a hydrodynamics based model (HD) of diffusion through a tube. The HD model yields excellent agreement with experimental conductances for VDAC and α-Hemolysin, while scaling bulk diffusion constants by 0.5 leads to underestimates of 10–20%. For PA, simulated ion conduction values overestimate experimental values by a factor of 1.5 to 7 (depending on His protonation state and the transmembrane potential), implying that the currently available computational model of this protein requires further structural refinement.
GCMC/BD; Channel conductance; Ion selectivity; VDAC; α-Hemolysin; anthrax toxin protective antigen pore
Translational Science Search (http://tscience.nlm.nih.gov) is a Web application for finding MEDLINE/PubMed journal articles that are regarded by their authors as novel, promising, or may have potential clinical application. A set of “translational” filters and related terms was created by reviewing journal articles published in clinical and translational science journals. Through E-Utilities, a user’s query and translational science (TS) filters are submitted to PubMed, and then the retrieved PubMed citations are matched with a database of MeSH terms (for disease conditions) and RxNorm (for interventions) to locate the search term, translational filters found, and associated interventions in the title and abstract. An algorithm ranks the Interventions and Conditions, and then highlights them in the results page for quick reading and evaluation. Using previously searched terms and standard formulas, the Precision and Recall of Translational Science Search (TSS) were 0.99 and 0.47, compared to 0.58 and 1.0 for PubMed Entrez, respectively.
ccPDB (http://crdd.osdd.net/raghava/ccpdb/) is a database of data sets compiled from the literature and Protein Data Bank (PDB). First, we collected and compiled data sets from the literature used for developing bioinformatics methods to annotate the structure and function of proteins. Second, data sets were derived from the latest release of PDB using standard protocols. Third, we developed a powerful module for creating a wide range of customized data sets from the current release of PDB. This is a flexible module that allows users to create data sets using a simple six step procedure. In addition, a number of web services have been integrated in ccPDB, which include submission of jobs on PDB-based servers, annotation of protein structures and generation of patterns. This database maintains >30 types of data sets such as secondary structure, tight-turns, nucleotide interacting residues, metals interacting residues, DNA/RNA binding residues and so on.
EMAGE (http://www.emouseatlas.org/emage) is a freely available online database of in situ gene expression patterns in the developing mouse embryo. Gene expression domains from raw images are extracted and integrated spatially into a set of standard 3D virtual mouse embryos at different stages of development, which allows data interrogation by spatial methods. An anatomy ontology is also used to describe sites of expression, which allows data to be queried using text-based methods. Here, we describe recent enhancements to EMAGE including: the release of a completely re-designed website, which offers integration of many different search functions in HTML web pages, improved user feedback and the ability to find similar expression patterns at the click of a button; back-end refactoring from an object oriented to relational architecture, allowing associated SQL access; and the provision of further access by standard formatted URLs and a Java API. We have also increased data coverage by sourcing from a greater selection of journals and developed automated methods for spatial data annotation that are being applied to spatially incorporate the genome-wide (∼19 000 gene) ‘EURExpress’ dataset into EMAGE.
Evaluating the biomedical literature and health-related websites for quality are challenging information retrieval tasks. Current commonly used methods include impact factor for journals, PubMed’s clinical query filters and machine learning-based filter models for articles, and PageRank for websites. Previous work has focused on the average performance of these methods without considering the topic, and it is unknown how performance varies for specific topics or focused searches. Clinicians, researchers, and users should be aware when expected performance is not achieved for specific topics. The present work analyzes the behavior of these methods for a variety of topics. Impact factor, clinical query filters, and PageRank vary widely across different topics while a topic-specific impact factor and machine learning-based filter models are more stable. The results demonstrate that a method may perform excellently on average but struggle when used on a number of narrower topics. Topic adjusted metrics and other topic robust methods have an advantage in such situations. Users of traditional topic-sensitive metrics should be aware of their limitations.
information retrieval; machine learning; PageRank; Journal Impact Factor; topic-sensitivity; bibliometrics
We introduce GenRev, a network-based software package developed to explore the functional relevance of genes generated as an intermediate result from numerous high-throughput technologies. GenRev searches for optimal intermediate nodes (genes) for the connection of input nodes via several algorithms, including the Klein-Ravi algorithm, the limited kWalks algorithm and a heuristic local search algorithm. Gene ranking and graph clustering analyses are integrated into the package. GenRev has the following features. (1) It provides users with great flexibility to define their own networks. (2) Users are allowed to define each gene’s importance in a subnetwork search by setting its score. (3) It is standalone and platform independent. (4) It provides an optimization in subnetwork search, which dramatically reduces the running time. GenRev is particularly designed for general use so that users have the flexibility to choose a reference network and define the score of genes. GenRev is freely available at http://bioinfo.mc.vanderbilt.edu/GenRev.html.
Gene ranking; Network; Subnetwork; Klein-Ravi algorithm; limited kWalks algorithm; Disease genes
High-throughput phenotyping technologies in combination with genetic variability for the plant model species Arabidopsis thaliana (Arabidopsis) offer an excellent experimental platform to reveal the effects of different gene combinations on phenotypes. These developments have been coupled with computational approaches to extract information not only from the multidimensional data, capturing various levels of biochemical organization, but also from various morphological and growth-related traits. Nevertheless, the existing methods usually focus on data aggregation which may neglect accession-specific effects. Here we argue that revealing the molecular mechanisms governing a desired set of output traits can be performed by ranking of accessions based on their efficiencies relative to all other analyzed accessions. To this end, we propose a framework for evaluating accessions via their relative efficiencies which establish a relationship between multidimensional system’s inputs and outputs from different environmental conditions. The framework combines data envelopment analysis (DEA) with a novel valency index characterizing the difference in congruence between the efficiency rankings of accessions under various conditions. We illustrate the advantages of the proposed approach for analyzing genetic variability on a publicly available data set comprising quantitative data on metabolic and morphological traits for 23 Arabidopsis accessions under three conditions of nitrogen availability. In addition, we extend the proposed framework to identify the set of traits displaying the highest influence on ranking based on the relative efficiencies of the considered accessions. As an outlook, we discuss how the proposed framework can be combined with well-established statistical techniques to further dissect the relationship between natural variability and metabolism.
data envelopment analysis; multivariate data analysis; genotypes; metabolomics; efficiency