Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.
We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set.
The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.
Biological signals may carry specific characteristics that reflect basic dynamics of the body. In particular, heart beat signals carry specific signatures that are related to human physiologic mechanisms. In recent years, many researchers have shown that representations which used non-linear symbolic sequences can often reveal much hidden dynamic information. This kind of symbolization proved to be useful for predicting life-threatening cardiac diseases.
This paper presents an improved method called the “Adaptive Interbeat Interval Analysis (AIIA) method”. The AIIA method uses the Simple K-Means algorithm for symbolization, which offers a new way to represent subtle variations between two interbeat intervals without human intervention. After symbolization, it uses the n-gram algorithm to generate different kinds of symbolic sequences. Each symbolic sequence stands for a variation phase. Finally, the symbolic sequences are categorized by classic classifiers.
In the experiments presented in this paper, AIIA method achieved 91% (3-gram, 26 clusters) accuracy in successfully classifying between the patients with Atrial Fibrillation (AF), Congestive Heart Failure (CHF) and healthy people. It also achieved 87% (3-gram, 26 clusters) accuracy in classifying the patients with apnea.
The two experiments presented in this paper demonstrate that AIIA method can categorize different heart diseases. Both experiments acquired the best category results when using the Bayesian Network. For future work, the concept of the AIIA method can be extended to the categorization of other physiological signals. More features can be added to improve the accuracy.
While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as metagenomes. In this article, I propose the Differential Direct Coding algorithm, a general-purpose nucleotide compression protocol that can differentiate between sequence data and auxiliary data by supporting the inclusion of supplementary symbols that are not members of the set of expected nucleotide bases, thereby offering reconciliation between sequence-specific and general-purpose compression strategies. This algorithm permits a sequence to contain a rich lexicon of auxiliary symbols that can represent wildcards, annotation data and special subsequences, such as functional domains or special repeats. In particular, the representation of special subsequences can be incorporated to provide structure-based coding that increases the overall degree of compression. Moreover, supporting a robust set of symbols removes the requirement of wildcard elimination and restoration phases, resulting in a complexity of O(n) for execution time, making this algorithm suitable for very large data sets. Because this algorithm compresses data on the basis of triplets, it is highly amenable to interpretation as a polypeptide at decompression time. Also, an encoded sequence may be further compressed using other existing algorithms, like gzip, thereby maximizing the final degree of compression. Overall, the Differential Direct Coding algorithm can offer a beneficial impact on disk traffic for database queries and other disk-intensive operations.
Arabidopsis thaliana is the most widely-studied plant today. The concerted efforts of over 11 000 researchers and 4000 organizations around the world are generating a rich diversity and quantity of information and materials. This information is made available through a comprehensive on-line resource called the Arabidopsis Information Resource (TAIR) (http://arabidopsis.org), which is accessible via commonly used web browsers and can be searched and downloaded in a number of ways. In the last two years, efforts have been focused on increasing data content and diversity, functionally annotating genes and gene products with controlled vocabularies, and improving data retrieval, analysis and visualization tools. New information include sequence polymorphisms including alleles, germplasms and phenotypes, Gene Ontology annotations, gene families, protein information, metabolic pathways, gene expression data from microarray experiments and seed and DNA stocks. New data visualization and analysis tools include SeqViewer, which interactively displays the genome from the whole chromosome down to 10 kb of nucleotide sequence and AraCyc, a metabolic pathway database and map tool that allows overlaying expression data onto the pathway diagrams. Finally, we have recently incorporated seed and DNA stock information from the Arabidopsis Biological Resource Center (ABRC) and implemented a shopping-cart style on-line ordering system.
Knowledge representation involves enumeration of conceptual symbols and arrangement of these symbols into some meaningful structure. Medical knowledge representation has traditionally focused more on the structure than the symbols. Several significant efforts are under way, at local, national, and international levels, to address the representation of the symbols though the creation of high-quality terminologies that are themselves knowledge based. This paper reviews these efforts, including the Medical Entities Dictionary (MED) in use at Columbia University and the New York Presbyterian Hospital. A decade's experience with the MED is summarized to serve as a proof-of-concept that knowledge-based terminologies can support the use of coded patient data for a variety of knowledge-based activities, including the improved understanding of patient data, the access of information sources relevant to specific patient care problems, the application of expert systems directly to the care of patients, and the discovery of new medical knowledge. The terminological knowledge in the MED has also been used successfully to support clinical application development and maintenance, including that of the MED itself. On the basis of this experience, current efforts to create standard knowledge-based terminologies appear to be justified.
Information on more than 35 000 full-length Oryza sativa cDNAs, together with associated microarray gene expression data collected under various treatment conditions, has made it feasible to identify motifs that are conserved in gene promoters and may act as cis-regulatory elements with key roles under the various conditions.
We have developed a novel tool that searches for cis-element candidates in the upstream, downstream, or coding regions of differentially regulated genes. The tool first lists cis-element candidates by motif searching based on the supposition that if there are cis-elements playing important roles in the regulation of a given set of genes, they will be statistically overrepresented and will be conserved. Then it evaluates the likelihood scores of the listed candidate motifs by association rule analysis. This strategy depends on the idea that motifs overrepresented in the promoter region could play specific roles in the regulation of expression of these genes. The tool is designed so that any biological researchers can use it easily at the publicly accessible Internet site . We evaluated the accuracy and utility of the tool by using a dataset of auxin-inducible genes that have well-studied cis-elements. The test showed the effectiveness of the tool in identifying significant relationships between cis-element candidates and related sets of genes.
The tool lists possible cis-element motifs corresponding to genes of interest, and it will contribute to the deeper understanding of gene regulatory mechanisms in plants.
In the last decade, a large amount of microarray gene expression data has been accumulated in public repositories. Integrating and analyzing high-throughput gene expression data have become key activities for exploring gene functions, gene networks and biological pathways. Effectively utilizing these invaluable microarray data remains challenging due to a lack of powerful tools to integrate large-scale gene-expression information across diverse experiments and to search and visualize a large number of gene-expression data points.
Gene Expression Browser is a microarray data integration, management and processing system with web-based search and visualization functions. An innovative method has been developed to define a treatment over a control for every microarray experiment to standardize and make microarray data from different experiments homogeneous. In the browser, data are pre-processed offline and the resulting data points are visualized online with a 2-layer dynamic web display. Users can view all treatments over control that affect the expression of a selected gene via Gene View, and view all genes that change in a selected treatment over control via treatment over control View. Users can also check the changes of expression profiles of a set of either the treatments over control or genes via Slide View. In addition, the relationships between genes and treatments over control are computed according to gene expression ratio and are shown as co-responsive genes and co-regulation treatments over control.
Gene Expression Browser is composed of a set of software tools, including a data extraction tool, a microarray data-management system, a data-annotation tool, a microarray data-processing pipeline, and a data search & visualization tool. The browser is deployed as a free public web service (http://www.ExpressionBrowser.com) that integrates 301 ATH1 gene microarray experiments from public data repositories (viz. the Gene Expression Omnibus repository at the National Center for Biotechnology Information and Nottingham Arabidopsis Stock Center). The set of Gene Expression Browser software tools can be easily applied to the large-scale expression data generated by other platforms and in other species.
DNA microarrays are a powerful tool for monitoring the expression of tens of thousands of genes simultaneously. With the advance of microarray technology, the challenge issue becomes how to analyze a large amount of microarray data and make biological sense of them. Affymetrix GeneChips are widely used microarrays, where a variety of statistical algorithms have been explored and used for detecting significant genes in the experiment. These methods rely solely on the quantitative data, i.e., signal intensity; however, qualitative data are also important parameters in detecting differentially expressed genes.
AffyMiner is a tool developed for detecting differentially expressed genes in Affymetrix GeneChip microarray data and for associating gene annotation and gene ontology information with the genes detected. AffyMiner consists of the functional modules, GeneFinder for detecting significant genes in a treatment versus control experiment and GOTree for mapping genes of interest onto the Gene Ontology (GO) space; and interfaces to run Cluster, a program for clustering analysis, and GenMAPP, a program for pathway analysis. AffyMiner has been used for analyzing the GeneChip data and the results were presented in several publications.
AffyMiner fills an important gap in finding differentially expressed genes in Affymetrix GeneChip microarray data. AffyMiner effectively deals with multiple replicates in the experiment and takes into account both quantitative and qualitative data in identifying significant genes. AffyMiner reduces the time and effort needed to compare data from multiple arrays and to interpret the possible biological implications associated with significant changes in a gene's expression.
Genomic microarrays are powerful research tools in bioinformatics and modern medicinal research because they enable massively-parallel assays and simultaneous monitoring of thousands of gene expression of biological samples. However, a simple microarray experiment often leads to very high-dimensional data and a huge amount of information, the vast amount of data challenges researchers into extracting the important features and reducing the high dimensionality. In this paper, a nonlinear dimensionality reduction kernel method based locally linear embedding(LLE) is proposed, and fuzzy K-nearest neighbors algorithm which denoises datasets will be introduced as a replacement to the classical LLE's KNN algorithm. In addition, kernel method based support vector machine (SVM) will be used to classify genomic microarray data sets in this paper. We demonstrate the application of the techniques to two published DNA microarray data sets. The experimental results confirm the superiority and high success rates of the presented method.
Manifold learning; Dimensionality reduction; Locally linear embedding; Kernel methods; Support vector machine
The HUGO Gene Nomenclature Committee (HGNC) aims to assign a unique gene symbol and name to every human gene. The HGNC database currently contains almost 30 000 approved gene symbols, over 19 000 of which represent protein-coding genes. The public website, www.genenames.org, displays all approved nomenclature within Symbol Reports that contain data curated by HGNC editors and links to related genomic, phenotypic and proteomic information. Here we describe improvements to our resources, including a new Quick Gene Search, a new List Search, an integrated HGNC BioMart and a new Statistics and Downloads facility.
Microarray experiments produce expression measurements in genomic scale. A way to derive functional understanding of the data is to focus on functional sets of genes, such as pathways, instead of individual genes. While a common practice for the pathway-level analysis has been functional enrichment analysis such as over-representation analysis and gene set enrichment analysis, an alternative approach has also been explored. In this approach, gene expression data are first aggregated at pathway level to transform the original data into a compact representation in which each row corresponds to a pathway instead of a gene. Thereafter the pathway expression data can be used for differential expression and classification analyses in pathway space, leveraging existing algorithms usually applied to gene expression data. While several studies have proposed the pathway-level aggregation methods, it remains unclear how they compare with one another, since the evaluations were done to a limited extent. Thus this study presents a comprehensive evaluation of six most prominent aggregation methods.
The compared methods include five existing methods--mean of all member genes (Mean all), mean of condition-responsive genes (Mean CORGs), analysis of sample set enrichment scores (ASSESS), principal component analysis (PCA), and partial least squares (PLS)--and a variant of an existing method (Mean top 50%, averaging top half of member genes). Comprehensive and stringent benchmarking was performed by collecting seven pairs of related but independent datasets encompassing various phenotypes. Aggregation was done in the space of KEGG pathways. Performance of the methods was assessed by classification accuracy validated both internally and externally, and by examining the correlative extent of pathway signatures between the dataset pairs. The assessment revealed that (i) the best accuracy and correlation were obtained from ASSESS and Mean top 50%, (ii) Mean all showed the lowest accuracy, and (iii) Mean CORGs and PLS gave rise to the largest extent of discordance in the pathway signature correlation.
The two best performing method (ASSESS and Mean top 50%) are suggested to be preferred. The benchmarking analysis also suggests that there is both room and necessity for developing a novel method for pathway-level aggregation.
Computational analysis of microarray data has provided an effective way to identify disease-related genes. Traditional disease gene selection methods from microarray data such as statistical test always focus on differentially expressed genes in different samples by individual gene prioritization. These traditional methods might miss differentially coexpressed (DCE) gene subsets because they ignore the interaction between genes. In this paper, MIClique algorithm is proposed to identify DEC gene subsets based on mutual information and clique analysis. Mutual information is used to measure the coexpression relationship between each pair of genes in two different kinds of samples. Clique analysis is a commonly used method in biological network, which generally represents biological module of similar function. By applying the MIClique algorithm to real gene expression data, some DEC gene subsets which correlated under one experimental condition but uncorrelated under another condition are detected from the graph of colon dataset and leukemia dataset.
An important challenge in data mining is in identifying “similar” temporal patterns that may illuminate hidden information in a database of time series. We are actively engaged in the development of a temporal database of several thousand ICU patient records that contains time-varying physiologic measurements recorded over each patient’s ICU stay. The discovery of multiparameter temporal patterns that are predictive of physiologic instability may aid clinicians in optimizing care for critically-ill patients.
In this paper, we introduce a novel temporal similarity metric based on a transformation of time series data into an intuitive symbolic representation. The symbolic transform is based on a wavelet decomposition to characterize time series dynamics at multiple time scales. The symbolic transformation allows us to utilize classical information retrieval algorithms based on a vector-space model. Our algorithm is capable of assessing the similarity between multi-dimensional time series and is computationally efficient.
We utilized our algorithm to identify similar physiologic patterns in hemodynamic time series from ICU patients. The similarities between different patient time series may have meaningful physiologic interpretations in the detection of impending hemodynamic deterioration, and may be of potential use in clinical decision-support systems. As a generalized time series similarity metric, the algorithms that are described have applications in several other domains as well.
Genew, the Human Gene Nomenclature Database, is the only resource that provides data for all human genes which have approved symbols. It is managed by the HUGO Gene Nomenclature Committee (HGNC) as a confidential database, containing over 16 000 records, 80% of which are represented on the Web by searchable text files. The data in Genew are highly curated by HGNC editors and gene records can be searched on the Web by symbol or name to directly retrieve information on gene symbol, gene name, cytogenetic location, OMIM number and PubMed ID. Data are integrated with other human gene databases, e.g. GDB, LocusLink and SWISS-PROT, and approved gene symbols are carefully co-ordinated with the Mouse Genome Database (MGD). Approved gene symbols are available for querying and browsing at http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl.
Because of the large volume of data and the intrinsic variation of data intensity observed in microarray experiments, different statistical methods have been used to systematically extract biological information and to quantify the associated uncertainty. The simplest method to identify differentially expressed genes is to evaluate the ratio of average intensities in two different conditions and consider all genes that differ by more than an arbitrary cut-off value to be differentially expressed. This filtering approach is not a statistical test and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed. At the same time the fold change by itself provide valuable information and it is important to find unambiguous ways of using this information in expression data treatment.
A new method of finding differentially expressed genes, called distributional fold change (DFC) test is introduced. The method is based on an analysis of the intensity distribution of all microarray probe sets mapped to a three dimensional feature space composed of average expression level, average difference of gene expression and total variance. The proposed method allows one to rank each feature based on the signal-to-noise ratio and to ascertain for each feature the confidence level and power for being differentially expressed. The performance of the new method was evaluated using the total and partial area under receiver operating curves and tested on 11 data sets from Gene Omnibus Database with independently verified differentially expressed genes and compared with the t-test and shrinkage t-test. Overall the DFC test performed the best – on average it had higher sensitivity and partial AUC and its elevation was most prominent in the low range of differentially expressed features, typical for formalin-fixed paraffin-embedded sample sets.
The distributional fold change test is an effective method for finding and ranking differentially expressed probesets on microarrays. The application of this test is advantageous to data sets using formalin-fixed paraffin-embedded samples or other systems where degradation effects diminish the applicability of correlation adjusted methods to the whole feature set.
Differential expression; Microarray; Feature selection; Fold change; Statistical test; ROC curve; FFPE
Several tools have been developed to perform global gene expression profile data analysis, to search for specific chromosomal regions whose features meet defined criteria as well as to study neighbouring gene expression. However, most of these tools are tailored for a specific use in a particular context (e.g. they are species-specific, or limited to a particular data format) and they typically accept only gene lists as input.
TRAM (Transcriptome Mapper) is a new general tool that allows the simple generation and analysis of quantitative transcriptome maps, starting from any source listing gene expression values for a given gene set (e.g. expression microarrays), implemented as a relational database. It includes a parser able to assign univocal and updated gene symbols to gene identifiers from different data sources. Moreover, TRAM is able to perform intra-sample and inter-sample data normalization, including an original variant of quantile normalization (scaled quantile), useful to normalize data from platforms with highly different numbers of investigated genes. When in 'Map' mode, the software generates a quantitative representation of the transcriptome of a sample (or of a pool of samples) and identifies if segments of defined lengths are over/under-expressed compared to the desired threshold. When in 'Cluster' mode, the software searches for a set of over/under-expressed consecutive genes. Statistical significance for all results is calculated with respect to genes localized on the same chromosome or to all genome genes. Transcriptome maps, showing differential expression between two sample groups, relative to two different biological conditions, may be easily generated. We present the results of a biological model test, based on a meta-analysis comparison between a sample pool of human CD34+ hematopoietic progenitor cells and a sample pool of megakaryocytic cells. Biologically relevant chromosomal segments and gene clusters with differential expression during the differentiation toward megakaryocyte were identified.
TRAM is designed to create, and statistically analyze, quantitative transcriptome maps, based on gene expression data from multiple sources. The release includes FileMaker Pro database management runtime application and it is freely available at http://apollo11.isto.unibo.it/software/, along with preconfigured implementations for mapping of human, mouse and zebrafish transcriptomes.
The main focus in pin-tip (or print-tip) microarray analysis is determining which probes, genes, or oligonucleotides are differentially expressed. Specifically in array comparative genomic hybridization (aCGH) experiments, researchers search for chromosomal imbalances in the genome. To model this data, scientists apply statistical methods to the structure of the experiment and assume that the data consist of the signal plus random noise. In this paper we propose “SmoothArray”, a new method to preprocess comparative genomic hybridization (CGH) bacterial artificial chromosome (BAC) arrays and we show the effects on a cancer dataset. As part of our R software package “aCGHplus,” this freely available algorithm removes the variation due to the intensity effects, pin/print-tip, the spatial location on the microarray chip, and the relative location from the well plate. removal of this variation improves the downstream analysis and subsequent inferences made on the data. Further, we present measures to evaluate the quality of the dataset according to the arrayer pins, 384-well plates, plate rows, and plate columns. We compare our method against competing methods using several metrics to measure the biological signal. With this novel normalization algorithm and quality control measures, the user can improve their inferences on datasets and pinpoint problems that may arise in their BAC aCGH technology.
In prokaryotic genomes, genes are organized in operons, and the genes within an operon tend to have similar levels of expression. Because of co-transcription of genes within an operon, borrowing information from other genes within the same operon can improve the estimation of relative transcript levels; the estimation of relative levels of transcript abundances is one of the most challenging tasks in experimental genomics due to the high noise level in microarray data. Therefore, techniques that can improve such estimations, and moreover are based on sound biological premises, are expected to benefit the field of microarray data analysis
In this paper, we propose a hierarchical Bayesian model, which relies on borrowing information from other genes within the same operon, to improve the estimation of gene expression levels and, hence, the detection of differentially expressed genes. The simulation studies and the analysis of experiential data demonstrated that the proposed method outperformed other techniques that are routinely used to estimate transcript levels and detect differentially expressed genes, including the sample mean and SAM t statistics. The improvement became more significant as the noise level in microarray data increases.
By borrowing information about transcriptional activity of genes within classified operons, we improved the estimation of gene expression levels and the detection of differentially expressed genes.
It is common for the results of a microarray study to be analyzed in the context of biologically-motivated groups of genes such as pathways or Gene Ontology categories. The most common method for such analysis uses the hypergeometric distribution (or a related technique) to look for "over-representation" of groups among genes selected as being differentially expressed or otherwise of interest based on a gene-by-gene analysis. However, this method suffers from some limitations, and biologist-friendly tools that implement alternatives have not been reported.
We introduce ErmineJ, a multiplatform user-friendly stand-alone software tool for the analysis of functionally-relevant sets of genes in the context of microarray gene expression data. ErmineJ implements multiple algorithms for gene set analysis, including over-representation and resampling-based methods that focus on gene scores or correlation of gene expression profiles. In addition to a graphical user interface, ErmineJ has a command line interface and an application programming interface that can be used to automate analyses. The graphical user interface includes tools for creating and modifying gene sets, visualizing the Gene Ontology as a table or tree, and visualizing gene expression data. ErmineJ comes with a complete user manual, and is open-source software licensed under the Gnu Public License.
The availability of multiple analysis algorithms, together with a rich feature set and simple graphical interface, should make ErmineJ a useful addition to the biologist's informatics toolbox. ErmineJ is available from .
Motivation: The identification of condition specific sub-networks from gene expression profiles has important biological applications, ranging from the selection of disease-related biomarkers to the discovery of pathway alterations across different phenotypes. Although many methods exist for extracting these sub-networks, very few existing approaches simultaneously consider both the differential expression of individual genes and the differential correlation of gene pairs, losing potentially valuable information in the data.
Results: In this article, we propose a new method, COSINE (COndition SpecIfic sub-NEtwork), which employs a scoring function that jointly measures the condition-specific changes of both ‘nodes’ (individual genes) and ‘edges’ (gene–gene co-expression). It uses the genetic algorithm to search for the single optimal sub-network which maximizes the scoring function. We applied COSINE to both simulated datasets with various differential expression patterns, and three real datasets, one prostate cancer dataset, a second one from the across-tissue comparison of morbidly obese patients and the other from the across-population comparison of the HapMap samples. Compared with previous methods, COSINE is more powerful in identifying truly significant sub-networks of appropriate size and meaningful biological relevance.
Availability: The R code is available as the COSINE package on CRAN: http://cran.r-project.org/web/packages/COSINE/index.html.
Supplementary information: Supplementary data are available at Bioinformatics online.
Routine application of gene expression microarray technology is rapidly producing large amounts of data that necessitate new approaches of analysis. The analysis of a specific microarray experiment profits enormously from cross-comparing to other experiments. This process is generally performed by numerical meta-analysis of published data where the researcher chooses the datasets to be analyzed based on assumptions about the biological relations of published datasets to his own data, thus severely limiting the possibility of finding surprising connections. Here we propose using a repository of published gene lists for the identification of interesting datasets to be subjected to more detailed numerical analysis.
We have compiled lists of genes that have been reported as differentially regulated in cancer related microarray studies. We searched these gene lists for statistically significant overlaps with lists of genes regulated by the tumor suppressors p16 and pRB. We identified a highly significant overlap of p16 and pRB target genes with genes regulated by the EWS/FLI fusion protein. Detailed numerical analysis of these data identified two sets of genes with clearly distinct roles in the G1/S and the G2/M phases of the cell cycle, as measured by enrichment of Gene Ontology categories.
We show that mining of published gene lists in the absence of numerical detail about gene expression levels constitutes a fast, easy to perform, widely applicable, and unbiased route towards the identification of biologically related gene expression microarray datasets.
Interpretation of simple microarray experiments is usually based on the fold-change of gene expression between a reference and a "treated" sample where the treatment can be of many types from drug exposure to genetic variation. Interpretation of the results usually combines lists of differentially expressed genes with previous knowledge about their biological function. Here we evaluate a method – based on the PageRank algorithm employed by the popular search engine Google – that tries to automate some of this procedure to generate prioritized gene lists by exploiting biological background information.
GeneRank is an intuitive modification of PageRank that maintains many of its mathematical properties. It combines gene expression information with a network structure derived from gene annotations (gene ontologies) or expression profile correlations. Using both simulated and real data we find that the algorithm offers an improved ranking of genes compared to pure expression change rankings.
Our modification of the PageRank algorithm provides an alternative method of evaluating microarray experimental results which combines prior knowledge about the underlying network. GeneRank offers an improvement compared to assessing the importance of a gene based on its experimentally observed fold-change alone and may be used as a basis for further analytical developments.
Microarray experiments generate vast amounts of data. The functional context of differentially expressed genes can be assessed by querying the Gene Ontology (GO) database via GoMiner. Directed acyclic graph representations, which are used to depict GO categories enriched with differentially expressed genes, are difficult to interpret and, depending on the particular analysis, may not be well suited for formulating new hypotheses. Additional graphical methods are therefore needed to augment the GO graphical representation.
We present an alternative visualization approach, area-proportional Euler diagrams, showing set relationships with semi-quantitative size information in a single diagram to support biological hypothesis formulation. The cardinalities of sets and intersection sets are represented by area-proportional Euler diagrams and their corresponding graphical (circular or polygonal) intersection areas. Optimally proportional representations are obtained using swarm and evolutionary optimization algorithms.
VennMaster's area-proportional Euler diagrams effectively structure and visualize the results of a GO analysis by indicating to what extent flagged genes are shared by different categories. In addition to reducing the complexity of the output, the visualizations facilitate generation of novel hypotheses from the analysis of seemingly unrelated categories that share differentially expressed genes.
The simultaneous analysis of expression data from multiple species reveals a core set of conserved cycling genes that is much larger than previously thought.
Global transcript levels throughout the cell cycle have been characterized using microarrays in several species. Early analysis of these experiments focused on individual species. More recently, a number of studies have concluded that a surprisingly small number of genes conserved in two or more species are periodically transcribed in these species. Combining and comparing data from multiple species is challenging because of noise in expression data, the different synchronization and scoring methods used, and the need to determine an accurate set of homologs.
To solve these problems, we developed and applied a new algorithm to analyze expression data from multiple species simultaneously. Unlike previous studies, we find that more than 20% of cycling genes in budding yeast have cycling homologs in fission yeast and 5% to 7% of cycling genes in each of four species have cycling homologs in all other species. These conserved cycling genes display much stronger cell cycle characteristics in several complementary high throughput datasets.
Essentiality analysis for yeast and human genes confirms these findings. Motif analysis indicates conservation in the corresponding regulatory mechanisms. Gene Ontology analysis and analysis of the genes in the conserved sets sheds light on the evolution of specific subfunctions within the cell cycle.
Our results indicate that the conservation in cyclic expression patterns is much greater than was previously thought. These genes are highly enriched for most cell cycle categories, and a large percentage of them are essential, supporting our claim that cross-species analysis can identify the core set of cycling genes.
Identification of differentially expressed genes from microarray datasets is one of the most important analyses for microarray data mining. Popular algorithms such as statistical t-test rank genes based on a single statistics. The false positive rate of these methods can be improved by considering other features of differentially expressed genes.
We proposed a pattern recognition strategy for identifying differentially expressed genes. Genes are mapped to a two dimension feature space composed of average difference of gene expression and average expression levels. A density based pruning algorithm (DB Pruning) is developed to screen out potential differentially expressed genes usually located in the sparse boundary region. Biases of popular algorithms for identifying differentially expressed genes are visually characterized. Experiments on 17 datasets from Gene Omnibus Database (GEO) with experimentally verified differentially expressed genes showed that DB pruning can significantly improve the prediction accuracy of popular identification algorithms such as t-test, rank product, and fold change.
Density based pruning of non-differentially expressed genes is an effective method for enhancing statistical testing based algorithms for identifying differentially expressed genes. It improves t-test, rank product, and fold change by 11% to 50% in the numbers of identified true differentially expressed genes. The source code of DB pruning is freely available on our website http://mleg.cse.sc.edu/degprune