Search tips
Search criteria

Results 1-25 (777425)

Clipboard (0)

Related Articles

1.  Concept-based query expansion for retrieving gene related publications from MEDLINE 
BMC Bioinformatics  2010;11:212.
Advances in biotechnology and in high-throughput methods for gene analysis have contributed to an exponential increase in the number of scientific publications in these fields of study. While much of the data and results described in these articles are entered and annotated in the various existing biomedical databases, the scientific literature is still the major source of information. There is, therefore, a growing need for text mining and information retrieval tools to help researchers find the relevant articles for their study. To tackle this, several tools have been proposed to provide alternative solutions for specific user requests.
This paper presents QuExT, a new PubMed-based document retrieval and prioritization tool that, from a given list of genes, searches for the most relevant results from the literature. QuExT follows a concept-oriented query expansion methodology to find documents containing concepts related to the genes in the user input, such as protein and pathway names. The retrieved documents are ranked according to user-definable weights assigned to each concept class. By changing these weights, users can modify the ranking of the results in order to focus on documents dealing with a specific concept. The method's performance was evaluated using data from the 2004 TREC genomics track, producing a mean average precision of 0.425, with an average of 4.8 and 31.3 relevant documents within the top 10 and 100 retrieved abstracts, respectively.
QuExT implements a concept-based query expansion scheme that leverages gene-related information available on a variety of biological resources. The main advantage of the system is to give the user control over the ranking of the results by means of a simple weighting scheme. Using this approach, researchers can effortlessly explore the literature regarding a group of genes and focus on the different aspects relating to these genes.
PMCID: PMC2873540  PMID: 20426836
2.  Doublet method for very fast autocoding 
Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computational methods used to autocode text must be very fast. The purpose of this paper is to describe the doublet method, a new algorithm for very fast autocoding.
An autocoder was written that transforms plain-text into intercalated word doublets (e.g. "The ciliary body produces aqueous humor" becomes "The ciliary, ciliary body, body produces, produces aqueous, aqueous humor"). Each doublet is checked against an index of doublets extracted from a standard nomenclature. Matching doublets are assigned a numeric code specific for each doublet found in the nomenclature. Text doublets that do not match the index of doublets extracted from the nomenclature are not part of valid nomenclature terms. Runs of matching doublets from text are concatenated and matched against nomenclature terms (also represented as runs of doublets).
The doublet autocoder was compared for speed and performance against a previously published phrase autocoder. Both autocoders are Perl scripts, and both autocoders used an identical text (a 170+ Megabyte collection of abstracts collected through a PubMed search) and the same nomenclature (neocl.xml, containing over 102,271 unique names of neoplasms). In side-by-side comparison on the same computer, the doublet method autocoder was 8.4 times faster than the phrase autocoder (211 seconds versus 1,776 seconds). The doublet method codes 0.8 Megabytes of text per second on a desktop computer with a 1.6 GHz processor. In addition, the doublet autocoder successfully matched terms that were missed by the phrase autocoder, while the phrase autocoder found no terms that were missed by the doublet autocoder.
The doublet method of autocoding is a novel algorithm for rapid text autocoding. The method will work with any nomenclature and will parse any ascii plain-text. An implementation of the algorithm in Perl is provided with this article. The algorithm, the Perl implementation, the neoplasm nomenclature, and Perl itself, are all open source materials.
PMCID: PMC521082  PMID: 15369595
3.  Dynamic programming re-ranking for PPI interactor and pair extraction in full-text articles 
BMC Bioinformatics  2011;12:60.
Experimentally verified protein-protein interactions (PPIs) cannot be easily retrieved by researchers unless they are stored in PPI databases. The curation of such databases can be facilitated by employing text-mining systems to identify genes which play the interactor role in PPIs and to map these genes to unique database identifiers (interactor normalization task or INT) and then to return a list of interaction pairs for each article (interaction pair task or IPT). These two tasks are evaluated in terms of the area under curve of the interpolated precision/recall (AUC iP/R) score because the order of identifiers in the output list is important for ease of curation.
Our INT system developed for the BioCreAtIvE II.5 INT challenge achieved a promising AUC iP/R of 43.5% by using a support vector machine (SVM)-based ranking procedure. Using our new re-ranking algorithm, we have been able to improve system performance (AUC iP/R) by 1.84%. Our experimental results also show that with the re-ranked INT results, our unsupervised IPT system can achieve a competitive AUC iP/R of 23.86%, which outperforms the best BC II.5 INT system by 1.64%. Compared to using only SVM ranked INT results, using re-ranked INT results boosts AUC iP/R by 7.84%. Statistical significance t-test results show that our INT/IPT system with re-ranking outperforms that without re-ranking by a statistically significant difference.
In this paper, we present a new re-ranking algorithm that considers co-occurrence among identifiers in an article to improve INT and IPT ranking results. Combining the re-ranked INT results with an unsupervised approach to find associations among interactors, the proposed method can boost the IPT performance. We also implement score computation using dynamic programming, which is faster and more efficient than traditional approaches.
PMCID: PMC3053256  PMID: 21342534
4.  Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene sets 
BMC Bioinformatics  2011;12(Suppl 10):S19.
Identification of transcription factors (TFs) responsible for modulation of differentially expressed genes is a key step in deducing gene regulatory pathways. Most current methods identify TFs by searching for presence of DNA binding motifs in the promoter regions of co-regulated genes. However, this strategy may not always be useful as presence of a motif does not necessarily imply a regulatory role. Conversely, motif presence may not be required for a TF to regulate a set of genes. Therefore, it is imperative to include functional (biochemical and molecular) associations, such as those found in the biomedical literature, into algorithms for identification of putative regulatory TFs that might be explicitly or implicitly linked to the genes under investigation.
In this study, we present a Latent Semantic Indexing (LSI) based text mining approach for identification and ranking of putative regulatory TFs from microarray derived differentially expressed genes (DEGs). Two LSI models were built using different term weighting schemes to devise pair-wise similarities between 21,027 mouse genes annotated in the Entrez Gene repository. Amongst these genes, 433 were designated TFs in the TRANSFAC database. The LSI derived TF-to-gene similarities were used to calculate TF literature enrichment p-values and rank the TFs for a given set of genes. We evaluated our approach using five different publicly available microarray datasets focusing on TFs Rel, Stat6, Ddit3, Stat5 and Nfic. In addition, for each of the datasets, we constructed gold standard TFs known to be functionally relevant to the study in question. Receiver Operating Characteristics (ROC) curves showed that the log-entropy LSI model outperformed the tf-normal LSI model and a benchmark co-occurrence based method for four out of five datasets, as well as motif searching approaches, in identifying putative TFs.
Our results suggest that our LSI based text mining approach can complement existing approaches used in systems biology research to decipher gene regulatory networks by providing putative lists of ranked TFs that might be explicitly or implicitly associated with sets of DEGs derived from microarray experiments. In addition, unlike motif searching approaches, LSI based approaches can reveal TFs that may indirectly regulate genes.
PMCID: PMC3236841  PMID: 22165960
5.  Robust rank aggregation for gene list integration and meta-analysis 
Bioinformatics  2012;28(4):573-580.
Motivation: The continued progress in developing technological platforms, availability of many published experimental datasets, as well as different statistical methods to analyze those data have allowed approaching the same research question using various methods simultaneously. To get the best out of all these alternatives, we need to integrate their results in an unbiased manner. Prioritized gene lists are a common result presentation method in genomic data analysis applications. Thus, the rank aggregation methods can become a useful and general solution for the integration task.
Results: Standard rank aggregation methods are often ill-suited for biological settings where the gene lists are inherently noisy. As a remedy, we propose a novel robust rank aggregation (RRA) method. Our method detects genes that are ranked consistently better than expected under null hypothesis of uncorrelated inputs and assigns a significance score for each gene. The underlying probabilistic model makes the algorithm parameter free and robust to outliers, noise and errors. Significance scores also provide a rigorous way to keep only the statistically relevant genes in the final list. These properties make our approach robust and compelling for many settings.
Availability: All the methods are implemented as a GNU R package RobustRankAggreg, freely available at the Comprehensive R Archive Network
Supplementary information Supplementary data are available at Bioinformatics online.
PMCID: PMC3278763  PMID: 22247279
6.  The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text 
BMC Bioinformatics  2011;12(Suppl 8):S3.
Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.
A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%.
The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.
PMCID: PMC3269938  PMID: 22151929
7.  Statistical Comparison Framework and Visualization Scheme for Ranking-Based Algorithms in High-Throughput Genome-Wide Studies 
Journal of Computational Biology  2009;16(4):565-577.
As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.
PMCID: PMC3148127  PMID: 19361328
genome-wide association studies; candidate lists
8.  Automatic Figure Ranking and User Interfacing for Intelligent Figure Search 
PLoS ONE  2010;5(10):e12983.
Figures are important experimental results that are typically reported in full-text bioscience articles. Bioscience researchers need to access figures to validate research facts and to formulate or to test novel research hypotheses. On the other hand, the sheer volume of bioscience literature has made it difficult to access figures. Therefore, we are developing an intelligent figure search engine ( Existing research in figure search treats each figure equally, but we introduce a novel concept of “figure ranking”: figures appearing in a full-text biomedical article can be ranked by their contribution to the knowledge discovery.
We empirically validated the hypothesis of figure ranking with over 100 bioscience researchers, and then developed unsupervised natural language processing (NLP) approaches to automatically rank figures. Evaluating on a collection of 202 full-text articles in which authors have ranked the figures based on importance, our best system achieved a weighted error rate of 0.2, which is significantly better than several other baseline systems we explored. We further explored a user interfacing application in which we built novel user interfaces (UIs) incorporating figure ranking, allowing bioscience researchers to efficiently access important figures. Our evaluation results show that 92% of the bioscience researchers prefer as the top two choices the user interfaces in which the most important figures are enlarged. With our automatic figure ranking NLP system, bioscience researchers preferred the UIs in which the most important figures were predicted by our NLP system than the UIs in which the most important figures were randomly assigned. In addition, our results show that there was no statistical difference in bioscience researchers' preference in the UIs generated by automatic figure ranking and UIs by human ranking annotation.
The evaluation results conclude that automatic figure ranking and user interfacing as we reported in this study can be fully implemented in online publishing. The novel user interface integrated with the automatic figure ranking system provides a more efficient and robust way to access scientific information in the biomedical domain, which will further enhance our existing figure search engine to better facilitate accessing figures of interest for bioscientists.
PMCID: PMC2951344  PMID: 20949102
9.  Gene selection with multiple ordering criteria 
BMC Bioinformatics  2007;8:74.
A microarray study may select different differentially expressed gene sets because of different selection criteria. For example, the fold-change and p-value are two commonly known criteria to select differentially expressed genes under two experimental conditions. These two selection criteria often result in incompatible selected gene sets. Also, in a two-factor, say, treatment by time experiment, the investigator may be interested in one gene list that responds to both treatment and time effects.
We propose three layer ranking algorithms, point-admissible, line-admissible (convex), and Pareto, to provide a preference gene list from multiple gene lists generated by different ranking criteria. Using the public colon data as an example, the layer ranking algorithms are applied to the three univariate ranking criteria, fold-change, p-value, and frequency of selections by the SVM-RFE classifier. A simulation experiment shows that for experiments with small or moderate sample sizes (less than 20 per group) and detecting a 4-fold change or less, the two-dimensional (p-value and fold-change) convex layer ranking selects differentially expressed genes with generally lower FDR and higher power than the standard p-value ranking. Three applications are presented. The first application illustrates a use of the layer rankings to potentially improve predictive accuracy. The second application illustrates an application to a two-factor experiment involving two dose levels and two time points. The layer rankings are applied to selecting differentially expressed genes relating to the dose and time effects. In the third application, the layer rankings are applied to a benchmark data set consisting of three dilution concentrations to provide a ranking system from a long list of differentially expressed genes generated from the three dilution concentrations.
The layer ranking algorithms are useful to help investigators in selecting the most promising genes from multiple gene lists generated by different filter, normalization, or analysis methods for various objectives.
PMCID: PMC1829166  PMID: 17338815
10.  Coordinated Regulation by AgrA, SarA, and SarR To Control agr Expression in Staphylococcus aureus▿† 
Journal of Bacteriology  2011;193(21):6020-6031.
The agr locus of Staphylococcus aureus is composed of two divergent transcripts (RNAII and RNAIII) driven by the P2 and P3 promoters. The P2-P3 intergenic region comprises the SarA/SarR binding sites and the four AgrA boxes to which AgrA binds. We reported here the role of AgrA, SarA, and SarR on agr P2 and P3 transcription. Using real-time reverse transcription (RT)-PCR and promoter fusion studies with selected single, double, triple, and complemented mutants, we showed that AgrA is indispensable to agr P2 and P3 transcription, whereas SarA activates and SarR represses P2 transcription. In vitro runoff transcription assays revealed that AgrA alone promoted transcription from the agr P2 promoter, with SarA enhancing it and SarR inhibiting agr P2 transcription in the presence of AgrA or with SarA and AgrA. Electrophoretic mobility shift assay (EMSA) analysis disclosed that SarR binds more avidly to the agr promoter than SarA and displaces SarA from the agr promoter. Additionally, SarA and AgrA bend the agr P2 promoter, whereas SarR does not. Collectively, these data indicated that AgrA activates agr P2 and P3 promoters while SarA activates the P2 promoter, presumably via bending of promoter DNA to bring together AgrA dimers to facilitate engagement of RNA polymerase (RNAP) to initiate transcription.
PMCID: PMC3194896  PMID: 21908676
11.  Ranked retrieval of segmented nuclei for objective assessment of cancer gene repositioning 
BMC Bioinformatics  2012;13:232.
Correct segmentation is critical to many applications within automated microscopy image analysis. Despite the availability of advanced segmentation algorithms, variations in cell morphology, sample preparation, and acquisition settings often lead to segmentation errors. This manuscript introduces a ranked-retrieval approach using logistic regression to automate selection of accurately segmented nuclei from a set of candidate segmentations. The methodology is validated on an application of spatial gene repositioning in breast cancer cell nuclei. Gene repositioning is analyzed in patient tissue sections by labeling sequences with fluorescence in situ hybridization (FISH), followed by measurement of the relative position of each gene from the nuclear center to the nuclear periphery. This technique requires hundreds of well-segmented nuclei per sample to achieve statistical significance. Although the tissue samples in this study contain a surplus of available nuclei, automatic identification of the well-segmented subset remains a challenging task.
Logistic regression was applied to features extracted from candidate segmented nuclei, including nuclear shape, texture, context, and gene copy number, in order to rank objects according to the likelihood of being an accurately segmented nucleus. The method was demonstrated on a tissue microarray dataset of 43 breast cancer patients, comprising approximately 40,000 imaged nuclei in which the HES5 and FRA2 genes were labeled with FISH probes. Three trained reviewers independently classified nuclei into three classes of segmentation accuracy. In man vs. machine studies, the automated method outperformed the inter-observer agreement between reviewers, as measured by area under the receiver operating characteristic (ROC) curve. Robustness of gene position measurements to boundary inaccuracies was demonstrated by comparing 1086 manually and automatically segmented nuclei. Pearson correlation coefficients between the gene position measurements were above 0.9 (p < 0.05). A preliminary experiment was conducted to validate the ranked retrieval in a test to detect cancer. Independent manual measurement of gene positions agreed with automatic results in 21 out of 26 statistical comparisons against a pooled normal (benign) gene position distribution.
Accurate segmentation is necessary to automate quantitative image analysis for applications such as gene repositioning. However, due to heterogeneity within images and across different applications, no segmentation algorithm provides a satisfactory solution. Automated assessment of segmentations by ranked retrieval is capable of reducing or even eliminating the need to select segmented objects by hand and represents a significant improvement over binary classification. The method can be extended to other high-throughput applications requiring accurate detection of cells or nuclei across a range of biomedical applications.
PMCID: PMC3484015  PMID: 22971117
12.  MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data 
Genome Biology  2008;9(Suppl 2):S5.
In the absence of consolidated pipelines to archive biological data electronically, information dispersed in the literature must be captured by manual annotation. Unfortunately, manual annotation is time consuming and the coverage of published interaction data is therefore far from complete. The use of text-mining tools to identify relevant publications and to assist in the initial information extraction could help to improve the efficiency of the curation process and, as a consequence, the database coverage of data available in the literature. The 2006 BioCreative competition was aimed at evaluating text-mining procedures in comparison with manual annotation of protein-protein interactions.
To aid the BioCreative protein-protein interaction task, IntAct and MINT (Molecular INTeraction) provided both the training and the test datasets. Data from both databases are comparable because they were curated according to the same standards. During the manual curation process, the major cause of data loss in mining the articles for information was ambiguity in the mapping of the gene names to stable UniProtKB database identifiers. It was also observed that most of the information about interactions was contained only within the full-text of the publication; hence, text mining of protein-protein interaction data will require the analysis of the full-text of the articles and cannot be restricted to the abstract.
The development of text-mining tools to extract protein-protein interaction information may increase the literature coverage achieved by manual curation. To support the text-mining community, databases will highlight those sentences within the articles that describe the interactions. These will supply data-miners with a high quality dataset for algorithm development. Furthermore, the dictionary of terms created by the BioCreative competitors could enrich the synonym list of the PSI-MI (Proteomics Standards Initiative-Molecular Interactions) controlled vocabulary, which is used by both databases to annotate their data content.
PMCID: PMC2559989  PMID: 18834496
13.  FACTA: a text search engine for finding associated biomedical concepts 
Bioinformatics  2008;24(21):2559-2560.
Summary: FACTA is a text search engine for MEDLINE abstracts, which is designed particularly to help users browse biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) appearing in the documents retrieved by the query. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics. Unlike existing systems that provide similar functionality, FACTA pre-indexes not only the words but also the concepts mentioned in the documents, which enables the user to issue a flexible query (e.g. free keywords or Boolean combinations of keywords/concepts) and receive the results immediately even when the number of the documents that match the query is very large. The user can also view snippets from MEDLINE to get textual evidence of associations between the query terms and the concepts. The concept IDs and their names/synonyms for building the indexes were collected from several biomedical databases and thesauri, such as UniProt, BioThesaurus, UMLS, KEGG and DrugBank.
Availability: The system is available at
PMCID: PMC2572701  PMID: 18772154
14.  BioCreative III interactive task: an overview 
BMC Bioinformatics  2011;12(Suppl 8):S4.
The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested.
A User Advisory Group (UAG) actively participated in the IAT design and assessment. The task focused on gene normalization (identifying gene mentions in the article and linking these genes to standard database identifiers), gene ranking based on the overall importance of each gene mentioned in the article, and gene-oriented document retrieval (identifying full text papers relevant to a selected gene). Six systems participated and all processed and displayed the same set of articles. The articles were selected based on content known to be problematic for curation, such as ambiguity of gene names, coverage of multiple genes and species, or introduction of a new gene name. Members of the UAG curated three articles for training and assessment purposes, and each member was assigned a system to review. A questionnaire related to the interface usability and task performance (as measured by precision and recall) was answered after systems were used to curate articles. Although the limited number of articles analyzed and users involved in the IAT experiment precluded rigorous quantitative analysis of the results, a qualitative analysis provided valuable insight into some of the problems encountered by users when using the systems. The overall assessment indicates that the system usability features appealed to most users, but the system performance was suboptimal (mainly due to low accuracy in gene normalization). Some of the issues included failure of species identification and gene name ambiguity in the gene normalization task leading to an extensive list of gene identifiers to review, which, in some cases, did not contain the relevant genes. The document retrieval suffered from the same shortfalls. The UAG favored achieving high performance (measured by precision and recall), but strongly recommended the addition of features that facilitate the identification of correct gene and its identifier, such as contextual information to assist in disambiguation.
The IAT was an informative exercise that advanced the dialog between curators and developers and increased the appreciation of challenges faced by each group. A major conclusion was that the intended users should be actively involved in every phase of software development, and this will be strongly encouraged in future tasks. The IAT Task provides the first steps toward the definition of metrics and functional requirements that are necessary for designing a formal evaluation of interactive curation systems in the BioCreative IV challenge.
PMCID: PMC3269939  PMID: 22151968
15.  Recommending MeSH terms for annotating biomedical articles 
Due to the high cost of manual curation of key aspects from the scientific literature, automated methods for assisting this process are greatly desired. Here, we report a novel approach to facilitate MeSH indexing, a challenging task of assigning MeSH terms to MEDLINE citations for their archiving and retrieval.
Unlike previous methods for automatic MeSH term assignment, we reformulate the indexing task as a ranking problem such that relevant MeSH headings are ranked higher than those irrelevant ones. Specifically, for each document we retrieve 20 neighbor documents, obtain a list of MeSH main headings from neighbors, and rank the MeSH main headings using ListNet–a learning-to-rank algorithm. We trained our algorithm on 200 documents and tested on a previously used benchmark set of 200 documents and a larger dataset of 1000 documents.
Tested on the benchmark dataset, our method achieved a precision of 0.390, recall of 0.712, and mean average precision (MAP) of 0.626. In comparison to the state of the art, we observe statistically significant improvements as large as 39% in MAP (p-value <0.001). Similar significant improvements were also obtained on the larger document set.
Experimental results show that our approach makes the most accurate MeSH predictions to date, which suggests its great potential in making a practical impact on MeSH indexing. Furthermore, as discussed the proposed learning framework is robust and can be adapted to many other similar tasks beyond MeSH indexing in the biomedical domain. All data sets are available at:
PMCID: PMC3168302  PMID: 21613640
Information storage and retrieval; translational research—application of biological knowledge to clinical care; discovery; text and data mining methods; natural-language processing; Automated learning; Delivering health information and knowledge to the public; algorithms; text mining; data mining; machine learning; mesh; MEDLINE indexing; learning to rank algorithm
16.  RankAggreg, an R package for weighted rank aggregation 
BMC Bioinformatics  2009;10:62.
Researchers in the field of bioinformatics often face a challenge of combining several ordered lists in a proper and efficient manner. Rank aggregation techniques offer a general and flexible framework that allows one to objectively perform the necessary aggregation. With the rapid growth of high-throughput genomic and proteomic studies, the potential utility of rank aggregation in the context of meta-analysis becomes even more apparent. One of the major strengths of rank-based aggregation is the ability to combine lists coming from different sources and platforms, for example different microarray chips, which may or may not be directly comparable otherwise.
The RankAggreg package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation using the package are given in the manuscript: one in the context of clustering based on gene expression, and the other one in the context of meta-analysis of prostate cancer microarray experiments.
The two examples described in the manuscript clearly show the utility of the RankAggreg package in the current bioinformatics context where ordered lists are routinely produced as a result of modern high-throughput technologies.
PMCID: PMC2669484  PMID: 19228411
17.  Finding biomarkers in non-model species: literature mining of transcription factors involved in bovine embryo development 
BioData Mining  2012;5:12.
Since processes in well-known model organisms have specific features different from those in Bos taurus, the organism under study, a good way to describe gene regulation in ruminant embryos would be a species-specific consideration of closely related species to cattle, sheep and pig. However, as highlighted by a recent report, gene dictionaries in pig are smaller than in cattle, bringing a risk to reduce the gene resources to be mined (and so for sheep dictionaries). Bioinformatics approaches that allow an integration of available information on gene function in model organisms, taking into account their specificity, are thus needed. Besides these closely related and biologically relevant species, there is indeed much more knowledge of (i) trophoblast proliferation and differentiation or (ii) embryogenesis in human and mouse species, which provides opportunities for reconstructing proliferation and/or differentiation processes in other mammalian embryos, including ruminants. The necessary knowledge can be obtained partly from (i) stem cell or cancer research to supply useful information on molecular agents or molecular interactions at work in cell proliferation and (ii) mouse embryogenesis to supply useful information on embryo differentiation. However, the total number of publications for all these topics and species is great and their manual processing would be tedious and time consuming. This is why we used text mining for automated text analysis and automated knowledge extraction. To evaluate the quality of this “mining”, we took advantage of studies that reported gene expression profiles during the elongation of bovine embryos and defined a list of transcription factors (or TF, n = 64) that we used as biological “gold standard”. When successful, the “mining” approach would identify them all, as well as novel ones.
To gain knowledge on molecular-genetic regulations in a non model organism, we offer an approach based on literature-mining and score arrangement of data from model organisms. This approach was applied to identify novel transcription factors during bovine blastocyst elongation, a process that is not observed in rodents and primates. As a result, searching through human and mouse corpuses, we identified numerous bovine homologs, among which 11 to 14% of transcription factors including the gold standard TF as well as novel TF potentially important to gene regulation in ruminant embryo development. The scripts of the workflow are written in Perl and available on demand. They require data input coming from all various databases for any kind of biological issue once the data has been prepared according to keywords for the studied topic and species; we can provide data sample to illustrate the use and functionality of the workflow.
To do so, we created a workflow that allowed the pipeline processing of literature data and biological data, extracted from Web of Science (WoS) or PubMed but also from Gene Expression Omnibus (GEO), Gene Ontology (GO), Uniprot, HomoloGene, TcoF-DB and TFe (TF encyclopedia). First, the human and mouse homologs of the bovine proteins were selected, filtered by text corpora and arranged by score functions. The score functions were based on the gene name frequencies in corpora. Then, transcription factors were identified using TcoF-DB and double-checked using TFe to characterise TF groups and families. Thus, among a search space of 18,670 bovine homologs, 489 were identified as transcription factors. Among them, 243 were absent from the high-throughput data available at the time of the study. They thus stand so far for putative TF acting during bovine embryo elongation, but might be retrieved from a recent RNA sequencing dataset (Mamo et al. , 2012). Beyond the 246 TF that appeared expressed in bovine elongating tissues, we restricted our interpretation to those occurring within a list of 50 top-ranked genes. Among the transcription factors identified therein, half belonged to the gold standard (ASCL2, c-FOS, ETS2, GATA3, HAND1) and half did not (ESR1, HES1, ID2, NANOG, PHB2, TP53, STAT3).
A workflow providing search for transcription factors acting in bovine elongation was developed. The model assumed that proteins sharing the same protein domains in closely related species had the same protein functionalities, even if they were differently regulated among species or involved in somewhat different pathways. Under this assumption, we merged the information on different mammalian species from different databases (literature and biology) and proposed 489 TF as potential participants of embryo proliferation and differentiation, with (i) a recall of 95% with regard to a biological gold standard defined in 2011 and (ii) an extension of more than 3 times the gold standard of TF detected so far in elongating tissues. The working capacity of the workflow was supported by the manual expertise of the biologists on the results. The workflow can serve as a new kind of bioinformatics tool to work on fused data sources and can thus be useful in studies of a wide range of biological processes.
PMCID: PMC3563503  PMID: 22931563
18.  Learning to Rank Figures within a Biomedical Article 
PLoS ONE  2014;9(3):e61567.
Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. This ever-increasing sheer volume has made it difficult for scientists to effectively and accurately access figures of their interest, the process of which is crucial for validating research facts and for formulating or testing novel research hypotheses. Current figure search applications can't fully meet this challenge as the “bag of figures” assumption doesn't take into account the relationship among figures. In our previous study, hundreds of biomedical researchers have annotated articles in which they serve as corresponding authors. They ranked each figure in their paper based on a figure's importance at their discretion, referred to as “figure ranking”. Using this collection of annotated data, we investigated computational approaches to automatically rank figures. We exploited and extended the state-of-the-art listwise learning-to-rank algorithms and developed a new supervised-learning model BioFigRank.
The cross-validation results show that BioFigRank yielded the best performance compared with other state-of-the-art computational models, and the greedy feature selection can further boost the ranking performance significantly. Furthermore, we carry out the evaluation by comparing BioFigRank with three-level competitive domain-specific human experts: (1) First Author, (2) Non-Author-In-Domain-Expert who is not the author nor co-author of an article but who works in the same field of the corresponding author of the article, and (3) Non-Author-Out-Domain-Expert who is not the author nor co-author of an article and who may or may not work in the same field of the corresponding author of an article. Our results show that BioFigRank outperforms Non-Author-Out-Domain-Expert and performs as well as Non-Author-In-Domain-Expert. Although BioFigRank underperforms First Author, since most biomedical researchers are either in- or out-domain-experts for an article, we conclude that BioFigRank represents an artificial intelligence system that offers expert-level intelligence to help biomedical researchers to navigate increasingly proliferated big data efficiently.
PMCID: PMC3953065  PMID: 24625719
19.  GLAD4U: deriving and prioritizing gene lists from PubMed literature 
BMC Genomics  2012;13(Suppl 8):S20.
Answering questions such as "Which genes are related to breast cancer?" usually requires retrieving relevant publications through the PubMed search engine, reading these publications, and creating gene lists. This process is not only time-consuming, but also prone to errors.
We report GLAD4U (Gene List Automatically Derived For You), a new, free web-based gene retrieval and prioritization tool. GLAD4U takes advantage of existing resources of the NCBI to ensure computational efficiency. The quality of gene lists created by GLAD4U for three Gene Ontology (GO) terms and three disease terms was assessed using corresponding "gold standard" lists curated in public databases. For all queries, GLAD4U gene lists showed very high recall but low precision, leading to low F-measure. As a comparison, EBIMed's recall was consistently lower than GLAD4U, but its precision was higher. To present the most relevant genes at the top of a list, we studied two prioritization methods based on publication count and the hypergeometric test, and compared the ranked lists and those generated by EBIMed to the gold standards. Both GLAD4U methods outperformed EBIMed for all queries based on a variety of quality metrics. Moreover, the hypergeometric method allowed for a better performance by thresholding genes with low scores. In addition, manual examination suggests that many false-positives could be explained by the incompleteness of the gold standards. The GLAD4U user interface accepts any valid queries for PubMed, and its output page displays the ranked gene list and information associated with each gene, chronologically-ordered supporting publications, along with a summary of the run and links for file export and functional enrichment and protein interaction network analysis.
GLAD4U has a high overall recall. Although precision is generally low, the prioritization methods successfully rank truly relevant genes at the top of the lists to facilitate efficient browsing. GLAD4U is simple to use, and its interface can be found at:
PMCID: PMC3535723  PMID: 23282288
20.  Automated Acquisition of Disease–Drug Knowledge from Biomedical and Clinical Documents: An Initial Study 
Explore the automated acquisition of knowledge in biomedical and clinical documents using text mining and statistical techniques to identify disease-drug associations.
Biomedical literature and clinical narratives from the patient record were mined to gather knowledge about disease-drug associations. Two NLP systems, BioMedLEE and MedLEE, were applied to Medline articles and discharge summaries, respectively. Disease and drug entities were identified using the NLP systems in addition to MeSH annotations for the Medline articles. Focusing on eight diseases, co-occurrence statistics were applied to compute and evaluate the strength of association between each disease and relevant drugs.
Ranked lists of disease-drug pairs were generated and cutoffs calculated for identifying stronger associations among these pairs for further analysis. Differences and similarities between the text sources (i.e., biomedical literature and patient record) and annotations (i.e., MeSH and NLP-extracted UMLS concepts) with regards to disease-drug knowledge were observed.
This paper presents a method for acquiring disease-specific knowledge and a feasibility study of the method. The method is based on applying a combination of NLP and statistical techniques to both biomedical and clinical documents. The approach enabled extraction of knowledge about the drugs clinicians are using for patients with specific diseases based on the patient record, while it is also acquired knowledge of drugs frequently involved in controlled trials for those same diseases. In comparing the disease-drug associations, we found the results to be appropriate: the two text sources contained consistent as well as complementary knowledge, and manual review of the top five disease-drug associations by a medical expert supported their correctness across the diseases.
PMCID: PMC2274872  PMID: 17947625
21.  Cloning and Gene Replacement Mutagenesis of a Pseudomonas atlantica Agarase Gene 
An agarase gene (agrA) was isolated by cloning genomic DNA prepared from Pseudomonas atlantica. The agarase activity in recombinant Escherichia coli was found in cell-free culture supernatants and could pass through a 0.45-μm-pore-size membrane separating cells from agar, suggesting that the gene product was exported in E. coli. The enzyme was specific for agar and agarose and did not digest alginate or carrageenan. Mutations generated by transposon mini-Mu d1(lacZ Kmr) were used to define the agrA coding region, as well as the direction of transcription of the gene. A procedure was developed to produce a P. atlantica agrA mutant. This required construction of an agrA::kan insertion mutation in vitro and subsequent introduction of the defect into the chromosome of P. atlantica by recombinational exchange. Transformation of P. atlantica with plasmids containing agrA::kan utilized a Tris-polyethylene glycol 6000-CaCl2 treatment for making competent cells. Replacement of wild-type agrA with agrA::kan resulted in loss of agarase activity. Uses of the agrA gene probe and an Agr− mutant for environmental studies are discussed.
PMCID: PMC202392  PMID: 16347536
22.  NGS-Trex: Next Generation Sequencing Transcriptome profile explorer 
BMC Bioinformatics  2013;14(Suppl 7):S10.
Next-Generation Sequencing (NGS) technology has exceptionally increased the ability to sequence DNA in a massively parallel and cost-effective manner. Nevertheless, NGS data analysis requires bioinformatics skills and computational resources well beyond the possibilities of many "wet biology" laboratories. Moreover, most of projects only require few sequencing cycles and standard tools or workflows to carry out suitable analyses for the identification and annotation of genes, transcripts and splice variants found in the biological samples under investigation. These projects can take benefits from the availability of easy to use systems to automatically analyse sequences and to mine data without the preventive need of strong bioinformatics background and hardware infrastructure.
To address this issue we developed an automatic system targeted to the analysis of NGS data obtained from large-scale transcriptome studies. This system, we named NGS-Trex (NGS Transcriptome profile explorer) is available through a simple web interface and allows the user to upload raw sequences and easily obtain an accurate characterization of the transcriptome profile after the setting of few parameters required to tune the analysis procedure. The system is also able to assess differential expression at both gene and transcript level (i.e. splicing isoforms) by comparing the expression profile of different samples.
By using simple query forms the user can obtain list of genes, transcripts, splice sites ranked and filtered according to several criteria. Data can be viewed as tables, text files or through a simple genome browser which helps the visual inspection of the data.
NGS-Trex is a simple tool for RNA-Seq data analysis mainly targeted to "wet biology" researchers with limited bioinformatics skills. It offers simple data mining tools to explore transcriptome profiles of samples investigated taking advantage of NGS technologies.
PMCID: PMC3633008  PMID: 23815181
23.  Evaluation of BioCreAtIvE assessment of task 2 
BMC Bioinformatics  2005;6(Suppl 1):S16.
Molecular Biology accumulated substantial amounts of data concerning functions of genes and proteins. Information relating to functional descriptions is generally extracted manually from textual data and stored in biological databases to build up annotations for large collections of gene products. Those annotation databases are crucial for the interpretation of large scale analysis approaches using bioinformatics or experimental techniques. Due to the growing accumulation of functional descriptions in biomedical literature the need for text mining tools to facilitate the extraction of such annotations is urgent. In order to make text mining tools useable in real world scenarios, for instance to assist database curators during annotation of protein function, comparisons and evaluations of different approaches on full text articles are needed.
The Critical Assessment for Information Extraction in Biology (BioCreAtIvE) contest consists of a community wide competition aiming to evaluate different strategies for text mining tools, as applied to biomedical literature. We report on task two which addressed the automatic extraction and assignment of Gene Ontology (GO) annotations of human proteins, using full text articles. The predictions of task 2 are based on triplets of protein – GO term – article passage. The annotation-relevant text passages were returned by the participants and evaluated by expert curators of the GO annotation (GOA) team at the European Institute of Bioinformatics (EBI). Each participant could submit up to three results for each sub-task comprising task 2. In total more than 15,000 individual results were provided by the participants. The curators evaluated in addition to the annotation itself, whether the protein and the GO term were correctly predicted and traceable through the submitted text fragment.
Concepts provided by GO are currently the most extended set of terms used for annotating gene products, thus they were explored to assess how effectively text mining tools are able to extract those annotations automatically. Although the obtained results are promising, they are still far from reaching the required performance demanded by real world applications. Among the principal difficulties encountered to address the proposed task, were the complex nature of the GO terms and protein names (the large range of variants which are used to express proteins and especially GO terms in free text), and the lack of a standard training set. A range of very different strategies were used to tackle this task. The dataset generated in line with the BioCreative challenge is publicly available and will allow new possibilities for training information extraction methods in the domain of molecular biology.
PMCID: PMC1869008  PMID: 15960828
24.  A genome-wide MeSH-based literature mining system predicts implicit gene-to-gene relationships and networks 
BMC Systems Biology  2013;7(Suppl 3):S9.
The large amount of literature in the post-genomics era enables the study of gene interactions and networks using all available articles published for a specific organism. MeSH is a controlled vocabulary of medical and scientific terms that is used by biomedical scientists to manually index articles in the PubMed literature database. We hypothesized that genome-wide gene-MeSH term associations from the PubMed literature database could be used to predict implicit gene-to-gene relationships and networks. While the gene-MeSH associations have been used to detect gene-gene interactions in some studies, different methods have not been well compared, and such a strategy has not been evaluated for a genome-wide literature analysis. Genome-wide literature mining of gene-to-gene interactions allows ranking of the best gene interactions and investigation of comprehensive biological networks at a genome level.
The genome-wide GenoMesh literature mining algorithm was developed by sequentially generating a gene-article matrix, a normalized gene-MeSH term matrix, and a gene-gene matrix. The gene-gene matrix relies on the calculation of pairwise gene dissimilarities based on gene-MeSH relationships. An optimized dissimilarity score was identified from six well-studied functions based on a receiver operating characteristic (ROC) analysis. Based on the studies with well-studied Escherichia coli and less-studied Brucella spp., GenoMesh was found to accurately identify gene functions using weighted MeSH terms, predict gene-gene interactions not reported in the literature, and cluster all the genes studied from an organism using the MeSH-based gene-gene matrix. A web-based GenoMesh literature mining program is also available at: GenoMesh also predicts gene interactions and networks among genes associated with specific MeSH terms or user-selected gene lists.
The GenoMesh algorithm and web program provide the first genome-wide, MeSH-based literature mining system that effectively predicts implicit gene-gene interaction relationships and networks in a genome-wide scope.
PMCID: PMC3852244  PMID: 24555475
25.  Robust assignment of cancer subtypes from expression data using a uni-variate gene expression average as classifier 
BMC Cancer  2010;10:532.
Genome wide gene expression data is a rich source for the identification of gene signatures suitable for clinical purposes and a number of statistical algorithms have been described for both identification and evaluation of such signatures. Some employed algorithms are fairly complex and hence sensitive to over-fitting whereas others are more simple and straight forward. Here we present a new type of simple algorithm based on ROC analysis and the use of metagenes that we believe will be a good complement to existing algorithms.
The basis for the proposed approach is the use of metagenes, instead of collections of individual genes, and a feature selection using AUC values obtained by ROC analysis. Each gene in a data set is assigned an AUC value relative to the tumor class under investigation and the genes are ranked according to these values. Metagenes are then formed by calculating the mean expression level for an increasing number of ranked genes, and the metagene expression value that optimally discriminates tumor classes in the training set is used for classification of new samples. The performance of the metagene is then evaluated using LOOCV and balanced accuracies.
We show that the simple uni-variate gene expression average algorithm performs as well as several alternative algorithms such as discriminant analysis and the more complex approaches such as SVM and neural networks. The R package rocc is freely available at
PMCID: PMC2966465  PMID: 20925936

Results 1-25 (777425)