As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.
genome-wide association studies; candidate lists
A major problem faced in biomedical informatics involves how best to present information retrieval results. When a single query retrieves many results, simply showing them as a long list often provides poor overview. With a goal of presenting users with reduced sets of relevant citations, this study developed an approach that retrieved and organized MEDLINE citations into different topical groups and prioritized important citations in each group.
A text mining system framework for automatic document clustering and ranking organized MEDLINE citations following simple PubMed queries. The system grouped the retrieved citations, ranked the citations in each cluster, and generated a set of keywords and MeSH terms to describe the common theme of each cluster.
Several possible ranking functions were compared, including citation count per year (CCPY), citation count (CC), and journal impact factor (JIF). We evaluated this framework by identifying as “important” those articles selected by the Surgical Oncology Society.
Our results showed that CCPY outperforms CC and JIF, i.e., CCPY better ranked important articles than did the others. Furthermore, our text clustering and knowledge extraction strategy grouped the retrieval results into informative clusters as revealed by the keywords and MeSH terms extracted from the documents in each cluster.
The text mining system studied effectively integrated text clustering, text summarization, and text ranking and organized MEDLINE retrieval results into different topical groups.
Researchers in the field of bioinformatics often face a challenge of combining several ordered lists in a proper and efficient manner. Rank aggregation techniques offer a general and flexible framework that allows one to objectively perform the necessary aggregation. With the rapid growth of high-throughput genomic and proteomic studies, the potential utility of rank aggregation in the context of meta-analysis becomes even more apparent. One of the major strengths of rank-based aggregation is the ability to combine lists coming from different sources and platforms, for example different microarray chips, which may or may not be directly comparable otherwise.
The RankAggreg package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation using the package are given in the manuscript: one in the context of clustering based on gene expression, and the other one in the context of meta-analysis of prostate cancer microarray experiments.
The two examples described in the manuscript clearly show the utility of the RankAggreg package in the current bioinformatics context where ordered lists are routinely produced as a result of modern high-throughput technologies.
Ranked gene lists from microarray experiments are usually analysed by assigning significance to predefined gene categories, e.g., based on functional annotations. Tools performing such analyses are often restricted to a category score based on a cutoff in the ranked list and a significance calculation based on random gene permutations as null hypothesis.
We analysed three publicly available data sets, in each of which samples were divided in two classes and genes ranked according to their correlation to class labels. We developed a program, Catmap (available for download at ), to compare different scores and null hypotheses in gene category analysis, using Gene Ontology annotations for category definition. When a cutoff-based score was used, results depended strongly on the choice of cutoff, introducing an arbitrariness in the analysis. Comparing results using random gene permutations and random sample permutations, respectively, we found that the assigned significance of a category depended strongly on the choice of null hypothesis. Compared to sample label permutations, gene permutations gave much smaller p-values for large categories with many coexpressed genes.
In gene category analyses of ranked gene lists, a cutoff independent score is preferable. The choice of null hypothesis is very important; random gene permutations does not work well as an approximation to sample label permutations.
The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or to a meta-analysis comparison, it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained, instead of just one list. Here we introduce a method, based on permutations, for studying the variability between lists (“list stability”) in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated by finding and comparing gene profiles on a large prostate cancer dataset, consisting of two cohorts of patients from different countries, for a total of 455 samples.
We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at http://pingu.unige.ch:8080/Toxicat.
To identify differentially expressed genes (DEGs) from microarray data, users of the Affymetrix GeneChip system need to select both a preprocessing algorithm to obtain expression-level measurements and a way of ranking genes to obtain the most plausible candidates. We recently recommended suitable combinations of a preprocessing algorithm and gene ranking method that can be used to identify DEGs with a higher level of sensitivity and specificity. However, in addition to these recommendations, researchers also want to know which combinations enhance reproducibility.
We compared eight conventional methods for ranking genes: weighted average difference (WAD), average difference (AD), fold change (FC), rank products (RP), moderated t statistic (modT), significance analysis of microarrays (samT), shrinkage t statistic (shrinkT), and intensity-based moderated t statistic (ibmT) with six preprocessing algorithms (PLIER, VSN, FARMS, multi-mgMOS (mmgMOS), MBEI, and GCRMA). A total of 36 real experimental datasets was evaluated on the basis of the area under the receiver operating characteristic curve (AUC) as a measure for both sensitivity and specificity. We found that the RP method performed well for VSN-, FARMS-, MBEI-, and GCRMA-preprocessed data, and the WAD method performed well for mmgMOS-preprocessed data. Our analysis of the MicroArray Quality Control (MAQC) project's datasets showed that the FC-based gene ranking methods (WAD, AD, FC, and RP) had a higher level of reproducibility: The percentages of overlapping genes (POGs) across different sites for the FC-based methods were higher overall than those for the t-statistic-based methods (modT, samT, shrinkT, and ibmT). In particular, POG values for WAD were the highest overall among the FC-based methods irrespective of the choice of preprocessing algorithm.
Our results demonstrate that to increase sensitivity, specificity, and reproducibility in microarray analyses, we need to select suitable combinations of preprocessing algorithms and gene ranking methods. We recommend the use of FC-based methods, in particular RP or WAD.
The first objective of a DNA microarray experiment is typically to generate a list of genes or probes that are found to be differentially expressed or represented (in the case of comparative genomic hybridizations and/or copy number variation) between two conditions or strains. Rank Products analysis comprises a robust algorithm for deriving such lists from microarray experiments that comprise small numbers of replicates, for example, less than the number required for the commonly used t-test. Currently, users wishing to apply Rank Products analysis to their own microarray data sets have been restricted to the use of command line-based software which can limit its usage within the biological community.
Here we have developed a web interface to existing Rank Products analysis tools allowing users to quickly process their data in an intuitive and step-wise manner to obtain the respective Rank Product or Rank Sum, probability of false prediction and p-values in a downloadable file.
The online interactive Rank Products analysis tool RankProdIt, for analysis of any data set containing measurements for multiple replicated conditions, is available at: http://strep-microarray.sbs.surrey.ac.uk/RankProducts
Explore the automated acquisition of knowledge in biomedical and clinical documents using text mining and statistical techniques to identify disease-drug associations.
Biomedical literature and clinical narratives from the patient record were mined to gather knowledge about disease-drug associations. Two NLP systems, BioMedLEE and MedLEE, were applied to Medline articles and discharge summaries, respectively. Disease and drug entities were identified using the NLP systems in addition to MeSH annotations for the Medline articles. Focusing on eight diseases, co-occurrence statistics were applied to compute and evaluate the strength of association between each disease and relevant drugs.
Ranked lists of disease-drug pairs were generated and cutoffs calculated for identifying stronger associations among these pairs for further analysis. Differences and similarities between the text sources (i.e., biomedical literature and patient record) and annotations (i.e., MeSH and NLP-extracted UMLS concepts) with regards to disease-drug knowledge were observed.
This paper presents a method for acquiring disease-specific knowledge and a feasibility study of the method. The method is based on applying a combination of NLP and statistical techniques to both biomedical and clinical documents. The approach enabled extraction of knowledge about the drugs clinicians are using for patients with specific diseases based on the patient record, while it is also acquired knowledge of drugs frequently involved in controlled trials for those same diseases. In comparing the disease-drug associations, we found the results to be appropriate: the two text sources contained consistent as well as complementary knowledge, and manual review of the top five disease-drug associations by a medical expert supported their correctness across the diseases.
The transcriptional regulator AgrA, a member of the LytTR family of proteins, plays a key role in controlling gene expression in some Gram-positive pathogens, including Staphylococcus aureus and Enterococcus faecalis. AgrA is encoded by the agrACDB global regulatory locus, and orthologues are found within the genome of most Clostridium difficile isolates, including the epidemic lineage 027/BI/NAP1. Comparative RNA sequencing of the wild type and otherwise isogenic agrA null mutant derivatives of C. difficile R20291 revealed a network of approximately 75 differentially regulated transcripts at late exponential growth phase, including many genes associated with flagellar assembly and function, such as the major structural subunit, FliC. Other differentially regulated genes include several involved in bis-(3′-5′)-cyclic dimeric GMP (c-di-GMP) synthesis and toxin A expression. C. difficile 027 R20291 agrA mutant derivatives were poorly flagellated and exhibited reduced levels of colonization and relapses in the murine infection model. Thus, the agr locus likely plays a contributory role in the fitness and virulence potential of C. difficile strains in the 027/BI/NAP1 lineage.
In recent years, although many ligand-binding site prediction methods have been developed, there has still been a great demand to improve the prediction accuracy and compare different prediction algorithms to evaluate their performances. In this work, in order to improve the performance of the protein-ligand binding site prediction method presented in our former study, a comparison of different binding site ranking lists was studied. Four kinds of properties, i.e., pocket size, distance from the protein centroid, sequence conservation and the number of hydrophobic residues, have been chosen as the corresponding ranking criterion respectively. Our studies show that the sequence conservation information helps to rank the real pockets with the most successful accuracy compared to others. At the same time, the pocket size and the distance of binding site from the protein centroid are also found to be helpful. In addition, a multi-view ranking aggregation method, which combines the information among those four properties, was further applied in our study. The results show that a better performance can be achieved by the aggregation of the complementary properties in the prediction of ligand-binding sites.
ranking aggregation; protein-ligand binding site; prediction
The pathophysiological mechanisms underlying the development of obesity and metabolic diseases are not well understood. To gain more insight into the genetic mediators associated with the onset and progression of diet-induced obesity and metabolic diseases, we studied the molecular changes in response to a high-fat diet (HFD) by using a mode-of-action by network identification (MNI) analysis. Oligo DNA microarray analysis was performed on visceral and subcutaneous adipose tissues and muscles of male C57BL/6N mice fed a normal diet or HFD for 2, 4, 8, and 12 weeks. Each of these data was queried against the MNI algorithm, and the lists of top 5 highly ranked genes and gene ontology (GO)-annotated pathways that were significantly overrepresented among the 100 highest ranked genes at each time point in the 3 different tissues of mice fed the HFD were considered in the present study. The 40 highest ranked genes identified by MNI analysis at each time point in the different tissues of mice with diet-induced obesity were subjected to clustering based on their temporal patterns. On the basis of the above-mentioned results, we investigated the sequential induction of distinct olfactory receptors and the stimulation of cancer-related genes during the development of obesity in both adipose tissues and muscles. The top 5 genes recognized using the MNI analysis at each time point and gene cluster identified based on their temporal patterns in the peripheral tissues of mice provided novel and often surprising insights into the potential genetic mediators for obesity progression.
We consider the problem of finding the set of rankings that best represents a given group of orderings on the same collection of elements (preference lists). This problem arises from social choice and voting theory, in which each voter gives a preference on a set of alternatives, and a system outputs a single preference order based on the observed voters’ preferences. In this paper, we observe that, if the given set of preference lists is not homogeneous, a unique true underling ranking might not exist. Moreover only the lists that share the highest amount of information should be aggregated, and thus multiple rankings might provide a more feasible solution to the problem. In this light, we propose Network Selection, an algorithm that, given a heterogeneous group of rankings, first discovers the different communities of homogeneous rankings and then combines only the rank orderings belonging to the same community into a single final ordering. Our novel approach is inspired by graph theory; indeed our set of lists can be loosely read as the nodes of a network. As a consequence, only the lists populating the same community in the network would then be aggregated. In order to highlight the strength of our proposal, we show an application both on simulated and on two real datasets, namely a financial and a biological dataset. Experimental results on simulated data show that Network Selection can significantly outperform existing related methods. The other way around, the empirical evidence achieved on real financial data reveals that Network Selection is also able to select the most relevant variables in data mining predictive models, providing a clear superiority in terms of predictive power of the models built. Furthermore, we show the potentiality of our proposal in the bioinformatics field, providing an application to a biological microarray dataset.
Listeria monocytogenes is a ubiquitous, opportunistic pathogenic organism. Environmental adaptation requires constant regulation of gene expression. Among transcriptional regulators, AgrA is part of an auto-induction system. Temperature is an environmental cue critical for in vivo adaptation. In order to investigate how temperature may affect AgrA-dependent transcription, we compared the transcriptomes of the parental strain L. monocytogenes EGD-e and its ΔagrA mutant at the saprophytic temperature of 25°C and in vivo temperature of 37°C. Variations of transcriptome were higher at 37°C than at 25°C. Results suggested that AgrA may be involved in the regulation of nitrogen transport, amino acids, purine and pyrimidine biosynthetic pathways and phage-related functions. Deregulations resulted in a growth advantage at 37°C, but affected salt tolerance. Finally, our results suggest overlaps with PrfA, σB, σH and CodY regulons. These overlaps may suggest that through AgrA, Listeria monocytogenes integrates information on its biotic environment.
Using gene co-expression analysis, researchers were able to predict clusters of genes with consistent functions that are relevant to cancer development and prognosis. We applied a weighted gene co-expression network (WGCN) analysis algorithm on glioblastoma multiforme (GBM) data obtained from the TCGA project and predicted a set of gene co-expression networks which are related to GBM prognosis.
We modified the Quasi-Clique Merger algorithm (QCM algorithm) into edge-covering Quasi-Clique Merger algorithm (eQCM) for mining weighted sub-network in WGCN. Each sub-network is considered a set of features to separate patients into two groups using K-means algorithm. Survival times of the two groups are compared using log-rank test and Kaplan-Meier curves. Simulations using random sets of genes are carried out to determine the thresholds for log-rank test p-values for network selection. Sub-networks with p-values less than their corresponding thresholds were further merged into clusters based on overlap ratios (>50%). The functions for each cluster are analyzed using gene ontology enrichment analysis.
Using the eQCM algorithm, we identified 8,124 sub-networks in the WGCN, out of which 170 sub-networks show p-values less than their corresponding thresholds. They were then merged into 16 clusters.
We identified 16 gene clusters associated with GBM prognosis using the eQCM algorithm. Our results not only confirmed previous findings including the importance of cell cycle and immune response in GBM, but also suggested important epigenetic events in GBM development and prognosis.
Identification of differentially expressed genes from microarray datasets is one of the most important analyses for microarray data mining. Popular algorithms such as statistical t-test rank genes based on a single statistics. The false positive rate of these methods can be improved by considering other features of differentially expressed genes.
We proposed a pattern recognition strategy for identifying differentially expressed genes. Genes are mapped to a two dimension feature space composed of average difference of gene expression and average expression levels. A density based pruning algorithm (DB Pruning) is developed to screen out potential differentially expressed genes usually located in the sparse boundary region. Biases of popular algorithms for identifying differentially expressed genes are visually characterized. Experiments on 17 datasets from Gene Omnibus Database (GEO) with experimentally verified differentially expressed genes showed that DB pruning can significantly improve the prediction accuracy of popular identification algorithms such as t-test, rank product, and fold change.
Density based pruning of non-differentially expressed genes is an effective method for enhancing statistical testing based algorithms for identifying differentially expressed genes. It improves t-test, rank product, and fold change by 11% to 50% in the numbers of identified true differentially expressed genes. The source code of DB pruning is freely available on our website http://mleg.cse.sc.edu/degprune
Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.
We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.
We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp
Since the inception of the GO annotation project, a variety of tools have been developed that support exploring and searching the GO database. In particular, a variety of tools that perform GO enrichment analysis are currently available. Most of these tools require as input a target set of genes and a background set and seek enrichment in the target set compared to the background set. A few tools also exist that support analyzing ranked lists. The latter typically rely on simulations or on union-bound correction for assigning statistical significance to the results.
GOrilla is a web-based application that identifies enriched GO terms in ranked lists of genes, without requiring the user to provide explicit target and background sets. This is particularly useful in many typical cases where genomic data may be naturally represented as a ranked list of genes (e.g. by level of expression or of differential expression). GOrilla employs a flexible threshold statistical approach to discover GO terms that are significantly enriched at the top of a ranked gene list. Building on a complete theoretical characterization of the underlying distribution, called mHG, GOrilla computes an exact p-value for the observed enrichment, taking threshold multiple testing into account without the need for simulations. This enables rigorous statistical analysis of thousand of genes and thousands of GO terms in order of seconds. The output of the enrichment analysis is visualized as a hierarchical structure, providing a clear view of the relations between enriched GO terms.
GOrilla is an efficient GO analysis tool with unique features that make a useful addition to the existing repertoire of GO enrichment tools. GOrilla's unique features and advantages over other threshold free enrichment tools include rigorous statistics, fast running time and an effective graphical representation. GOrilla is publicly available at:
We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k).
We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.
By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.
To contribute to a further insight into heterosis we applied an integrative analysis to a systems biological network approach and a quantitative genetics analysis towards biomass heterosis in early Arabidopsis thaliana development. The study was performed on the parental accessions C24 and Col-0 and the reciprocal crosses. In an over-representation analysis it was tested if the overlap between the resulting gene lists of the two approaches is significantly larger than expected by chance. Top ranked genes in the results list of the systems biological analysis were significantly over-represented in the heterotic QTL candidate regions for either hybrid as well as regarding mid-parent and best-parent heterosis. This suggests that not only a few but rather several genes that influence biomass heterosis are located within each heterotic QTL region. Furthermore, the overlapping resulting genes of the two integrated approaches were particularly enriched in biomass related pathways. A chromosome-wise over-representation analysis gave rise to the hypothesis that chromosomes number 2 and 4 probably carry a majority of the genes involved in biomass heterosis in the early development of Arabidopsis thaliana.
Comparing independent high-throughput gene-expression experiments can generate hypotheses about which gene-expression programs are shared between particular biological processes. Current techniques to compare expression profiles typically involve choosing a fixed differential expression threshold to summarize results, potentially reducing sensitivity to small but concordant changes. We present a threshold-free algorithm called Rank–rank Hypergeometric Overlap (RRHO). This algorithm steps through two gene lists ranked by the degree of differential expression observed in two profiling experiments, successively measuring the statistical significance of the number of overlapping genes. The output is a graphical map that shows the strength, pattern and bounds of correlation between two expression profiles. To demonstrate RRHO sensitivity and dynamic range, we identified shared expression networks in cancer microarray profiles driving tumor progression, stem cell properties and response to targeted kinase inhibition. We demonstrate how RRHO can be used to determine which model system or drug treatment best reflects a particular biological or disease response. The threshold-free and graphical aspects of RRHO complement other rank-based approaches such as Gene Set Enrichment Analysis (GSEA), for which RRHO is a 2D analog. Rank–rank overlap analysis is a sensitive, robust and web-accessible method for detecting and visualizing overlap trends between two complete, continuous gene-expression profiles. A web-based implementation of RRHO can be accessed at http://systems.crump.ucla.edu/rankrank/.
A microarray study may select different differentially expressed gene sets because of different selection criteria. For example, the fold-change and p-value are two commonly known criteria to select differentially expressed genes under two experimental conditions. These two selection criteria often result in incompatible selected gene sets. Also, in a two-factor, say, treatment by time experiment, the investigator may be interested in one gene list that responds to both treatment and time effects.
We propose three layer ranking algorithms, point-admissible, line-admissible (convex), and Pareto, to provide a preference gene list from multiple gene lists generated by different ranking criteria. Using the public colon data as an example, the layer ranking algorithms are applied to the three univariate ranking criteria, fold-change, p-value, and frequency of selections by the SVM-RFE classifier. A simulation experiment shows that for experiments with small or moderate sample sizes (less than 20 per group) and detecting a 4-fold change or less, the two-dimensional (p-value and fold-change) convex layer ranking selects differentially expressed genes with generally lower FDR and higher power than the standard p-value ranking. Three applications are presented. The first application illustrates a use of the layer rankings to potentially improve predictive accuracy. The second application illustrates an application to a two-factor experiment involving two dose levels and two time points. The layer rankings are applied to selecting differentially expressed genes relating to the dose and time effects. In the third application, the layer rankings are applied to a benchmark data set consisting of three dilution concentrations to provide a ranking system from a long list of differentially expressed genes generated from the three dilution concentrations.
The layer ranking algorithms are useful to help investigators in selecting the most promising genes from multiple gene lists generated by different filter, normalization, or analysis methods for various objectives.
The control of virulence gene expression in the human pathogen Staphylococcus aureus is under the partial control of the two-component quorum-sensing system encoded by genes of the agr locus. The product of the agrA gene has been shown by amino acid sequence similarity to be the putative response regulator; however, binding of AgrA to promoters under its control has not yet been demonstrated. In this study, we isolated and purified soluble AgrA by expression under osmotic shock conditions and ion-exchange chromatography. Purified AgrA showed high-affinity binding to the RNAIII-agr intergenic region by electrophoretic mobility shift assays. Binding was localized by DNase I protection assays to a pair of direct repeats in the P2 and P3 promoter regions of the agr locus. We found that this binding was enhanced by the addition of the small phosphoryl donor, acetyl phosphate. The difference in binding affinity between these two promoters was found to result from a 2-bp difference between the downstream direct repeats of the P2 and P3 sites. Mutation of these base pairs in the P3 site to match those found in the P2 site increased the affinity of AgrA for the P3 site relative to that for the P2 site. These results are consistent with the function of AgrA as a response regulator with recognition sites in the promoter regions of RNAIII and the agr locus.
An agarase gene (agrA) was isolated by cloning genomic DNA prepared from Pseudomonas atlantica. The agarase activity in recombinant Escherichia coli was found in cell-free culture supernatants and could pass through a 0.45-μm-pore-size membrane separating cells from agar, suggesting that the gene product was exported in E. coli. The enzyme was specific for agar and agarose and did not digest alginate or carrageenan. Mutations generated by transposon mini-Mu d1(lacZ Kmr) were used to define the agrA coding region, as well as the direction of transcription of the gene. A procedure was developed to produce a P. atlantica agrA mutant. This required construction of an agrA::kan insertion mutation in vitro and subsequent introduction of the defect into the chromosome of P. atlantica by recombinational exchange. Transformation of P. atlantica with plasmids containing agrA::kan utilized a Tris-polyethylene glycol 6000-CaCl2 treatment for making competent cells. Replacement of wild-type agrA with agrA::kan resulted in loss of agarase activity. Uses of the agrA gene probe and an Agr− mutant for environmental studies are discussed.
Gene Set Enrichment Analysis (GSEA) is a computational method for the statistical evaluation of sorted lists of genes or proteins. Originally GSEA was developed for interpreting microarray gene expression data, but it can be applied to any sorted list of genes. Given the gene list and an arbitrary biological category, GSEA evaluates whether the genes of the considered category are randomly distributed or accumulated on top or bottom of the list. Usually, significance scores (p-values) of GSEA are computed by nonparametric permutation tests, a time consuming procedure that yields only estimates of the p-values.
We present a novel dynamic programming algorithm for calculating exact significance values of unweighted Gene Set Enrichment Analyses. Our algorithm avoids typical problems of nonparametric permutation tests, as varying findings in different runs caused by the random sampling procedure. Another advantage of the presented dynamic programming algorithm is its runtime and memory efficiency. To test our algorithm, we applied it not only to simulated data sets, but additionally evaluated expression profiles of squamous cell lung cancer tissue and autologous unaffected tissue.