Researchers in the field of bioinformatics often face a challenge of combining several ordered lists in a proper and efficient manner. Rank aggregation techniques offer a general and flexible framework that allows one to objectively perform the necessary aggregation. With the rapid growth of high-throughput genomic and proteomic studies, the potential utility of rank aggregation in the context of meta-analysis becomes even more apparent. One of the major strengths of rank-based aggregation is the ability to combine lists coming from different sources and platforms, for example different microarray chips, which may or may not be directly comparable otherwise.
The RankAggreg package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation using the package are given in the manuscript: one in the context of clustering based on gene expression, and the other one in the context of meta-analysis of prostate cancer microarray experiments.
The two examples described in the manuscript clearly show the utility of the RankAggreg package in the current bioinformatics context where ordered lists are routinely produced as a result of modern high-throughput technologies.
As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.
genome-wide association studies; candidate lists
Ranked gene lists from microarray experiments are usually analysed by assigning significance to predefined gene categories, e.g., based on functional annotations. Tools performing such analyses are often restricted to a category score based on a cutoff in the ranked list and a significance calculation based on random gene permutations as null hypothesis.
We analysed three publicly available data sets, in each of which samples were divided in two classes and genes ranked according to their correlation to class labels. We developed a program, Catmap (available for download at ), to compare different scores and null hypotheses in gene category analysis, using Gene Ontology annotations for category definition. When a cutoff-based score was used, results depended strongly on the choice of cutoff, introducing an arbitrariness in the analysis. Comparing results using random gene permutations and random sample permutations, respectively, we found that the assigned significance of a category depended strongly on the choice of null hypothesis. Compared to sample label permutations, gene permutations gave much smaller p-values for large categories with many coexpressed genes.
In gene category analyses of ranked gene lists, a cutoff independent score is preferable. The choice of null hypothesis is very important; random gene permutations does not work well as an approximation to sample label permutations.
The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or to a meta-analysis comparison, it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained, instead of just one list. Here we introduce a method, based on permutations, for studying the variability between lists (“list stability”) in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated by finding and comparing gene profiles on a large prostate cancer dataset, consisting of two cohorts of patients from different countries, for a total of 455 samples.
A major problem faced in biomedical informatics involves how best to present information retrieval results. When a single query retrieves many results, simply showing them as a long list often provides poor overview. With a goal of presenting users with reduced sets of relevant citations, this study developed an approach that retrieved and organized MEDLINE citations into different topical groups and prioritized important citations in each group.
A text mining system framework for automatic document clustering and ranking organized MEDLINE citations following simple PubMed queries. The system grouped the retrieved citations, ranked the citations in each cluster, and generated a set of keywords and MeSH terms to describe the common theme of each cluster.
Several possible ranking functions were compared, including citation count per year (CCPY), citation count (CC), and journal impact factor (JIF). We evaluated this framework by identifying as “important” those articles selected by the Surgical Oncology Society.
Our results showed that CCPY outperforms CC and JIF, i.e., CCPY better ranked important articles than did the others. Furthermore, our text clustering and knowledge extraction strategy grouped the retrieval results into informative clusters as revealed by the keywords and MeSH terms extracted from the documents in each cluster.
The text mining system studied effectively integrated text clustering, text summarization, and text ranking and organized MEDLINE retrieval results into different topical groups.
The first objective of a DNA microarray experiment is typically to generate a list of genes or probes that are found to be differentially expressed or represented (in the case of comparative genomic hybridizations and/or copy number variation) between two conditions or strains. Rank Products analysis comprises a robust algorithm for deriving such lists from microarray experiments that comprise small numbers of replicates, for example, less than the number required for the commonly used t-test. Currently, users wishing to apply Rank Products analysis to their own microarray data sets have been restricted to the use of command line-based software which can limit its usage within the biological community.
Here we have developed a web interface to existing Rank Products analysis tools allowing users to quickly process their data in an intuitive and step-wise manner to obtain the respective Rank Product or Rank Sum, probability of false prediction and p-values in a downloadable file.
The online interactive Rank Products analysis tool RankProdIt, for analysis of any data set containing measurements for multiple replicated conditions, is available at: http://strep-microarray.sbs.surrey.ac.uk/RankProducts
Explore the automated acquisition of knowledge in biomedical and clinical documents using text mining and statistical techniques to identify disease-drug associations.
Biomedical literature and clinical narratives from the patient record were mined to gather knowledge about disease-drug associations. Two NLP systems, BioMedLEE and MedLEE, were applied to Medline articles and discharge summaries, respectively. Disease and drug entities were identified using the NLP systems in addition to MeSH annotations for the Medline articles. Focusing on eight diseases, co-occurrence statistics were applied to compute and evaluate the strength of association between each disease and relevant drugs.
Ranked lists of disease-drug pairs were generated and cutoffs calculated for identifying stronger associations among these pairs for further analysis. Differences and similarities between the text sources (i.e., biomedical literature and patient record) and annotations (i.e., MeSH and NLP-extracted UMLS concepts) with regards to disease-drug knowledge were observed.
This paper presents a method for acquiring disease-specific knowledge and a feasibility study of the method. The method is based on applying a combination of NLP and statistical techniques to both biomedical and clinical documents. The approach enabled extraction of knowledge about the drugs clinicians are using for patients with specific diseases based on the patient record, while it is also acquired knowledge of drugs frequently involved in controlled trials for those same diseases. In comparing the disease-drug associations, we found the results to be appropriate: the two text sources contained consistent as well as complementary knowledge, and manual review of the top five disease-drug associations by a medical expert supported their correctness across the diseases.
The control of virulence gene expression in the human pathogen Staphylococcus aureus is under the partial control of the two-component quorum-sensing system encoded by genes of the agr locus. The product of the agrA gene has been shown by amino acid sequence similarity to be the putative response regulator; however, binding of AgrA to promoters under its control has not yet been demonstrated. In this study, we isolated and purified soluble AgrA by expression under osmotic shock conditions and ion-exchange chromatography. Purified AgrA showed high-affinity binding to the RNAIII-agr intergenic region by electrophoretic mobility shift assays. Binding was localized by DNase I protection assays to a pair of direct repeats in the P2 and P3 promoter regions of the agr locus. We found that this binding was enhanced by the addition of the small phosphoryl donor, acetyl phosphate. The difference in binding affinity between these two promoters was found to result from a 2-bp difference between the downstream direct repeats of the P2 and P3 sites. Mutation of these base pairs in the P3 site to match those found in the P2 site increased the affinity of AgrA for the P3 site relative to that for the P2 site. These results are consistent with the function of AgrA as a response regulator with recognition sites in the promoter regions of RNAIII and the agr locus.
Using gene co-expression analysis, researchers were able to predict clusters of genes with consistent functions that are relevant to cancer development and prognosis. We applied a weighted gene co-expression network (WGCN) analysis algorithm on glioblastoma multiforme (GBM) data obtained from the TCGA project and predicted a set of gene co-expression networks which are related to GBM prognosis.
We modified the Quasi-Clique Merger algorithm (QCM algorithm) into edge-covering Quasi-Clique Merger algorithm (eQCM) for mining weighted sub-network in WGCN. Each sub-network is considered a set of features to separate patients into two groups using K-means algorithm. Survival times of the two groups are compared using log-rank test and Kaplan-Meier curves. Simulations using random sets of genes are carried out to determine the thresholds for log-rank test p-values for network selection. Sub-networks with p-values less than their corresponding thresholds were further merged into clusters based on overlap ratios (>50%). The functions for each cluster are analyzed using gene ontology enrichment analysis.
Using the eQCM algorithm, we identified 8,124 sub-networks in the WGCN, out of which 170 sub-networks show p-values less than their corresponding thresholds. They were then merged into 16 clusters.
We identified 16 gene clusters associated with GBM prognosis using the eQCM algorithm. Our results not only confirmed previous findings including the importance of cell cycle and immune response in GBM, but also suggested important epigenetic events in GBM development and prognosis.
Listeria monocytogenes is a gram-positive facultative intracellular food-borne pathogen that can cause severe infections in humans and animals. We have recently adapted signature-tagged transposon mutagenesis (STM) to identify genes involved in the virulence of L. monocytogenes. A new round of STM allowed us to identify a new locus encoding a protein homologous to AgrA, the well-studied response regulator of Staphylococcus aureus and part of a two-component system involved in bacterial virulence. The production of several secreted proteins was modified in the agrA mutant of L. monocytogenes grown in broth, indicating that the agr locus influenced protein secretion. Inactivation of agrA did not affect the ability of the pathogen to invade and multiply in cells in vitro. However, the virulence of the agrA mutant was attenuated in the mouse (a 10-fold increase in the 50% lethal dose by the intravenous route), demonstrating for the first time a role for the agr locus in the virulence of L. monocytogenes.
Identification of differentially expressed genes from microarray datasets is one of the most important analyses for microarray data mining. Popular algorithms such as statistical t-test rank genes based on a single statistics. The false positive rate of these methods can be improved by considering other features of differentially expressed genes.
We proposed a pattern recognition strategy for identifying differentially expressed genes. Genes are mapped to a two dimension feature space composed of average difference of gene expression and average expression levels. A density based pruning algorithm (DB Pruning) is developed to screen out potential differentially expressed genes usually located in the sparse boundary region. Biases of popular algorithms for identifying differentially expressed genes are visually characterized. Experiments on 17 datasets from Gene Omnibus Database (GEO) with experimentally verified differentially expressed genes showed that DB pruning can significantly improve the prediction accuracy of popular identification algorithms such as t-test, rank product, and fold change.
Density based pruning of non-differentially expressed genes is an effective method for enhancing statistical testing based algorithms for identifying differentially expressed genes. It improves t-test, rank product, and fold change by 11% to 50% in the numbers of identified true differentially expressed genes. The source code of DB pruning is freely available on our website http://mleg.cse.sc.edu/degprune
A microarray study may select different differentially expressed gene sets because of different selection criteria. For example, the fold-change and p-value are two commonly known criteria to select differentially expressed genes under two experimental conditions. These two selection criteria often result in incompatible selected gene sets. Also, in a two-factor, say, treatment by time experiment, the investigator may be interested in one gene list that responds to both treatment and time effects.
We propose three layer ranking algorithms, point-admissible, line-admissible (convex), and Pareto, to provide a preference gene list from multiple gene lists generated by different ranking criteria. Using the public colon data as an example, the layer ranking algorithms are applied to the three univariate ranking criteria, fold-change, p-value, and frequency of selections by the SVM-RFE classifier. A simulation experiment shows that for experiments with small or moderate sample sizes (less than 20 per group) and detecting a 4-fold change or less, the two-dimensional (p-value and fold-change) convex layer ranking selects differentially expressed genes with generally lower FDR and higher power than the standard p-value ranking. Three applications are presented. The first application illustrates a use of the layer rankings to potentially improve predictive accuracy. The second application illustrates an application to a two-factor experiment involving two dose levels and two time points. The layer rankings are applied to selecting differentially expressed genes relating to the dose and time effects. In the third application, the layer rankings are applied to a benchmark data set consisting of three dilution concentrations to provide a ranking system from a long list of differentially expressed genes generated from the three dilution concentrations.
The layer ranking algorithms are useful to help investigators in selecting the most promising genes from multiple gene lists generated by different filter, normalization, or analysis methods for various objectives.
We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at http://pingu.unige.ch:8080/Toxicat.
To identify differentially expressed genes (DEGs) from microarray data, users of the Affymetrix GeneChip system need to select both a preprocessing algorithm to obtain expression-level measurements and a way of ranking genes to obtain the most plausible candidates. We recently recommended suitable combinations of a preprocessing algorithm and gene ranking method that can be used to identify DEGs with a higher level of sensitivity and specificity. However, in addition to these recommendations, researchers also want to know which combinations enhance reproducibility.
We compared eight conventional methods for ranking genes: weighted average difference (WAD), average difference (AD), fold change (FC), rank products (RP), moderated t statistic (modT), significance analysis of microarrays (samT), shrinkage t statistic (shrinkT), and intensity-based moderated t statistic (ibmT) with six preprocessing algorithms (PLIER, VSN, FARMS, multi-mgMOS (mmgMOS), MBEI, and GCRMA). A total of 36 real experimental datasets was evaluated on the basis of the area under the receiver operating characteristic curve (AUC) as a measure for both sensitivity and specificity. We found that the RP method performed well for VSN-, FARMS-, MBEI-, and GCRMA-preprocessed data, and the WAD method performed well for mmgMOS-preprocessed data. Our analysis of the MicroArray Quality Control (MAQC) project's datasets showed that the FC-based gene ranking methods (WAD, AD, FC, and RP) had a higher level of reproducibility: The percentages of overlapping genes (POGs) across different sites for the FC-based methods were higher overall than those for the t-statistic-based methods (modT, samT, shrinkT, and ibmT). In particular, POG values for WAD were the highest overall among the FC-based methods irrespective of the choice of preprocessing algorithm.
Our results demonstrate that to increase sensitivity, specificity, and reproducibility in microarray analyses, we need to select suitable combinations of preprocessing algorithms and gene ranking methods. We recommend the use of FC-based methods, in particular RP or WAD.
Comparing independent high-throughput gene-expression experiments can generate hypotheses about which gene-expression programs are shared between particular biological processes. Current techniques to compare expression profiles typically involve choosing a fixed differential expression threshold to summarize results, potentially reducing sensitivity to small but concordant changes. We present a threshold-free algorithm called Rank–rank Hypergeometric Overlap (RRHO). This algorithm steps through two gene lists ranked by the degree of differential expression observed in two profiling experiments, successively measuring the statistical significance of the number of overlapping genes. The output is a graphical map that shows the strength, pattern and bounds of correlation between two expression profiles. To demonstrate RRHO sensitivity and dynamic range, we identified shared expression networks in cancer microarray profiles driving tumor progression, stem cell properties and response to targeted kinase inhibition. We demonstrate how RRHO can be used to determine which model system or drug treatment best reflects a particular biological or disease response. The threshold-free and graphical aspects of RRHO complement other rank-based approaches such as Gene Set Enrichment Analysis (GSEA), for which RRHO is a 2D analog. Rank–rank overlap analysis is a sensitive, robust and web-accessible method for detecting and visualizing overlap trends between two complete, continuous gene-expression profiles. A web-based implementation of RRHO can be accessed at http://systems.crump.ucla.edu/rankrank/.
To contribute to a further insight into heterosis we applied an integrative analysis to a systems biological network approach and a quantitative genetics analysis towards biomass heterosis in early Arabidopsis thaliana development. The study was performed on the parental accessions C24 and Col-0 and the reciprocal crosses. In an over-representation analysis it was tested if the overlap between the resulting gene lists of the two approaches is significantly larger than expected by chance. Top ranked genes in the results list of the systems biological analysis were significantly over-represented in the heterotic QTL candidate regions for either hybrid as well as regarding mid-parent and best-parent heterosis. This suggests that not only a few but rather several genes that influence biomass heterosis are located within each heterotic QTL region. Furthermore, the overlapping resulting genes of the two integrated approaches were particularly enriched in biomass related pathways. A chromosome-wise over-representation analysis gave rise to the hypothesis that chromosomes number 2 and 4 probably carry a majority of the genes involved in biomass heterosis in the early development of Arabidopsis thaliana.
The pathophysiological mechanisms underlying the development of obesity and metabolic diseases are not well understood. To gain more insight into the genetic mediators associated with the onset and progression of diet-induced obesity and metabolic diseases, we studied the molecular changes in response to a high-fat diet (HFD) by using a mode-of-action by network identification (MNI) analysis. Oligo DNA microarray analysis was performed on visceral and subcutaneous adipose tissues and muscles of male C57BL/6N mice fed a normal diet or HFD for 2, 4, 8, and 12 weeks. Each of these data was queried against the MNI algorithm, and the lists of top 5 highly ranked genes and gene ontology (GO)-annotated pathways that were significantly overrepresented among the 100 highest ranked genes at each time point in the 3 different tissues of mice fed the HFD were considered in the present study. The 40 highest ranked genes identified by MNI analysis at each time point in the different tissues of mice with diet-induced obesity were subjected to clustering based on their temporal patterns. On the basis of the above-mentioned results, we investigated the sequential induction of distinct olfactory receptors and the stimulation of cancer-related genes during the development of obesity in both adipose tissues and muscles. The top 5 genes recognized using the MNI analysis at each time point and gene cluster identified based on their temporal patterns in the peripheral tissues of mice provided novel and often surprising insights into the potential genetic mediators for obesity progression.
Reproducibility of results can have a significant impact on the acceptance of new technologies in gene expression analysis. With the recent introduction of the so-called next-generation sequencing (NGS) technology and established microarrays, one is able to choose between two completely different platforms for gene expression measurements. This study introduces a novel methodology for gene-ranking stability analysis that is applied to the evaluation of gene-ranking reproducibility on NGS and microarray data.
The same data used in a well-known MicroArray Quality Control (MAQC) study was also used in this study to compare ranked lists of genes from MAQC samples A and B, obtained from Affymetrix HG-U133 Plus 2.0 and Roche 454 Genome Sequencer FLX platforms. An initial evaluation, where the percentage of overlapping genes was observed, demonstrates higher reproducibility on microarray data in 10 out of 11 gene-ranking methods. A gene set enrichment analysis shows similar enrichment of top gene sets when NGS is compared with microarrays on a pathway level. Our novel approach demonstrates high accuracy of decision trees when used for knowledge extraction from multiple bootstrapped gene set enrichment analysis runs. A comparison of the two approaches in sample preparation for high-throughput sequencing shows that alternating decision trees represent the optimal knowledge representation method in comparison with classical decision trees.
Usual reproducibility measurements are mostly based on statistical techniques that offer very limited biological insights into the studied gene expression data sets. This paper introduces the meta-learning-based gene set enrichment analysis that can be used to complement the analysis of gene-ranking stability estimation techniques such as percentage of overlapping genes or classic gene set enrichment analysis. It is useful and practical when reproducibility of gene ranking results or different gene selection techniques is observed. The proposed method reveals very accurate descriptive models that capture the co-enrichment of gene sets which are differently enriched in the compared data sets.
In recent years, although many ligand-binding site prediction methods have been developed, there has still been a great demand to improve the prediction accuracy and compare different prediction algorithms to evaluate their performances. In this work, in order to improve the performance of the protein-ligand binding site prediction method presented in our former study, a comparison of different binding site ranking lists was studied. Four kinds of properties, i.e., pocket size, distance from the protein centroid, sequence conservation and the number of hydrophobic residues, have been chosen as the corresponding ranking criterion respectively. Our studies show that the sequence conservation information helps to rank the real pockets with the most successful accuracy compared to others. At the same time, the pocket size and the distance of binding site from the protein centroid are also found to be helpful. In addition, a multi-view ranking aggregation method, which combines the information among those four properties, was further applied in our study. The results show that a better performance can be achieved by the aggregation of the complementary properties in the prediction of ligand-binding sites.
ranking aggregation; protein-ligand binding site; prediction
The transcriptional regulator AgrA, a member of the LytTR family of proteins, plays a key role in controlling gene expression in some Gram-positive pathogens, including Staphylococcus aureus and Enterococcus faecalis. AgrA is encoded by the agrACDB global regulatory locus, and orthologues are found within the genome of most Clostridium difficile isolates, including the epidemic lineage 027/BI/NAP1. Comparative RNA sequencing of the wild type and otherwise isogenic agrA null mutant derivatives of C. difficile R20291 revealed a network of approximately 75 differentially regulated transcripts at late exponential growth phase, including many genes associated with flagellar assembly and function, such as the major structural subunit, FliC. Other differentially regulated genes include several involved in bis-(3′-5′)-cyclic dimeric GMP (c-di-GMP) synthesis and toxin A expression. C. difficile 027 R20291 agrA mutant derivatives were poorly flagellated and exhibited reduced levels of colonization and relapses in the murine infection model. Thus, the agr locus likely plays a contributory role in the fitness and virulence potential of C. difficile strains in the 027/BI/NAP1 lineage.
We consider the problem of finding the set of rankings that best represents a given group of orderings on the same collection of elements (preference lists). This problem arises from social choice and voting theory, in which each voter gives a preference on a set of alternatives, and a system outputs a single preference order based on the observed voters’ preferences. In this paper, we observe that, if the given set of preference lists is not homogeneous, a unique true underling ranking might not exist. Moreover only the lists that share the highest amount of information should be aggregated, and thus multiple rankings might provide a more feasible solution to the problem. In this light, we propose Network Selection, an algorithm that, given a heterogeneous group of rankings, first discovers the different communities of homogeneous rankings and then combines only the rank orderings belonging to the same community into a single final ordering. Our novel approach is inspired by graph theory; indeed our set of lists can be loosely read as the nodes of a network. As a consequence, only the lists populating the same community in the network would then be aggregated. In order to highlight the strength of our proposal, we show an application both on simulated and on two real datasets, namely a financial and a biological dataset. Experimental results on simulated data show that Network Selection can significantly outperform existing related methods. The other way around, the empirical evidence achieved on real financial data reveals that Network Selection is also able to select the most relevant variables in data mining predictive models, providing a clear superiority in terms of predictive power of the models built. Furthermore, we show the potentiality of our proposal in the bioinformatics field, providing an application to a biological microarray dataset.
Listeria monocytogenes is a ubiquitous, opportunistic pathogenic organism. Environmental adaptation requires constant regulation of gene expression. Among transcriptional regulators, AgrA is part of an auto-induction system. Temperature is an environmental cue critical for in vivo adaptation. In order to investigate how temperature may affect AgrA-dependent transcription, we compared the transcriptomes of the parental strain L. monocytogenes EGD-e and its ΔagrA mutant at the saprophytic temperature of 25°C and in vivo temperature of 37°C. Variations of transcriptome were higher at 37°C than at 25°C. Results suggested that AgrA may be involved in the regulation of nitrogen transport, amino acids, purine and pyrimidine biosynthetic pathways and phage-related functions. Deregulations resulted in a growth advantage at 37°C, but affected salt tolerance. Finally, our results suggest overlaps with PrfA, σB, σH and CodY regulons. These overlaps may suggest that through AgrA, Listeria monocytogenes integrates information on its biotic environment.
Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.
We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.
We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp
The accessory gene regulator (agr) locus, a candidate system for the regulation of the production of virulence factors in Staphylococcus intermedius, has been characterized. Using PCR-based genome walking, we have obtained the first complete sequence (3,436 bp) of the accessory gene regulator (agr) gene in this organism. Sequence analysis of the agr gene has identified five open reading frames (ORFs), agrB, agrD, agrC, agrA, and hld. The translated ORF contained amino acid motifs characteristic of the response regulator and histidine protein kinase signal transducer of the classic two-component regulatory system. Sequencing of the agrD PCR products amplified from DNA from 20 different isolates has facilitated detection of genetic variation in the putative autoinducing peptide (AIP) within the agr gene of S. intermedius, revealing the presence of at least three agr specificity groups within this species. Classification of the agr gene from S. intermedius was supported by phylogenetic analysis. Real-time PCR also revealed that the effector molecule of the agr system, RNAIII, was regulated in an autocrine manner in S. intermedius and demonstrated positive correlation with the temporal gene expression patterns of luk and entC. Transcription of RNAIII was also dependent on self secreted cues. Cyclic self and nonself peptides were synthesized on the basis of the novel AIPs produced by S. intermedius, which lack the cysteine necessary to form the thiolactone ring in analogous peptides from Staphylococcus aureus and Staphylococcus epidermidis. Experiments with these synthetic cyclic peptides indicated that self peptides led to up-regulation of RNAIII—findings in support of the assumption that activation of the agr gene is initiated by growth- and species-specific factors generated during bacterial growth.
Text mining tools have gained popularity to process the vast amount of available research articles in the biomedical literature. It is crucial that such tools extract information with a sufficient level of detail to be applicable in real life scenarios. Studies of mining non-causal molecular relations attribute to this goal by formally identifying the relations between genes, promoters, complexes and various other molecular entities found in text. More importantly, these studies help to enhance integration of text mining results with database facts.
We describe, compare and evaluate two frameworks developed for the prediction of non-causal or 'entity' relations (REL) between gene symbols and domain terms. For the corresponding REL challenge of the BioNLP Shared Task of 2011, these systems ranked first (57.7% F-score) and second (41.6% F-score). In this paper, we investigate the performance discrepancy of 16 percentage points by benchmarking on a related and more extensive dataset, analysing the contribution of both the term detection and relation extraction modules. We further construct a hybrid system combining the two frameworks and experiment with intersection and union combinations, achieving respectively high-precision and high-recall results. Finally, we highlight extremely high-performance results (F-score >90%) obtained for the specific subclass of embedded entity relations that are essential for integrating text mining predictions with database facts.
The results from this study will enable us in the near future to annotate semantic relations between molecular entities in the entire scientific literature available through PubMed. The recent release of the EVEX dataset, containing biomolecular event predictions for millions of PubMed articles, is an interesting and exciting opportunity to overlay these entity relations with event predictions on a literature-wide scale.