1.  The Genopolis Microarray Database 
BMC Bioinformatics  2007;8(Suppl 1):S21.
Gene expression databases are key resources for microarray data management and analysis and the importance of a proper annotation of their content is well understood.
Public repositories as well as microarray database systems that can be implemented by single laboratories exist. However, there is not yet a tool that can easily support a collaborative environment where different users with different rights of access to data can interact to define a common highly coherent content. The scope of the Genopolis database is to provide a resource that allows different groups performing microarray experiments related to a common subject to create a common coherent knowledge base and to analyse it. The Genopolis database has been implemented as a dedicated system for the scientific community studying dendritic and macrophage cells functions and host-parasite interactions.
The Genopolis Database system allows the community to build an object based MIAME compliant annotation of their experiments and to store images, raw and processed data from the Affymetrix GeneChip® platform. It supports dynamical definition of controlled vocabularies and provides automated and supervised steps to control the coherence of data and annotations. It allows a precise control of the visibility of the database content to different sub groups in the community and facilitates exports of its content to public repositories. It provides an interactive users interface for data analysis: this allows users to visualize data matrices based on functional lists and sample characterization, and to navigate to other data matrices defined by similarity of expression values as well as functional characterizations of genes involved. A collaborative environment is also provided for the definition and sharing of functional annotation by users.
The Genopolis Database supports a community in building a common coherent knowledge base and analyse it. This fills a gap between a local database and a public repository, where the development of a common coherent annotation is important. In its current implementation, it provides a uniform coherently annotated dataset on dendritic cells and macrophage differentiation.
PMCID: PMC1885851  PMID: 17430566
2.  Microarray meta-analysis database (M2DB): a uniformly pre-processed, quality controlled, and manually curated human clinical microarray database 
BMC Bioinformatics  2010;11:421.
Over the past decade, gene expression microarray studies have greatly expanded our knowledge of genetic mechanisms of human diseases. Meta-analysis of substantial amounts of accumulated data, by integrating valuable information from multiple studies, is becoming more important in microarray research. However, collecting data of special interest from public microarray repositories often present major practical problems. Moreover, including low-quality data may significantly reduce meta-analysis efficiency.
M2DB is a human curated microarray database designed for easy querying, based on clinical information and for interactive retrieval of either raw or uniformly pre-processed data, along with a set of quality-control metrics. The database contains more than 10,000 previously published Affymetrix GeneChip arrays, performed using human clinical specimens. M2DB allows online querying according to a flexible combination of five clinical annotations describing disease state and sampling location. These annotations were manually curated by controlled vocabularies, based on information obtained from GEO, ArrayExpress, and published papers. For array-based assessment control, the online query provides sets of QC metrics, generated using three available QC algorithms. Arrays with poor data quality can easily be excluded from the query interface. The query provides values from two algorithms for gene-based filtering, and raw data and three kinds of pre-processed data for downloading.
M2DB utilizes a user-friendly interface for QC parameters, sample clinical annotations, and data formats to help users obtain clinical metadata. This database provides a lower entry threshold and an integrated process of meta-analysis. We hope that this research will promote further evolution of microarray meta-analysis.
PMCID: PMC2928207  PMID: 20698961
3.  The Medicago truncatula gene expression atlas web server 
BMC Bioinformatics  2009;10:441.
Legumes (Leguminosae or Fabaceae) play a major role in agriculture. Transcriptomics studies in the model legume species, Medicago truncatula, are instrumental in helping to formulate hypotheses about the role of legume genes. With the rapid growth of publically available Affymetrix GeneChip Medicago Genome Array GeneChip data from a great range of tissues, cell types, growth conditions, and stress treatments, the legume research community desires an effective bioinformatics system to aid efforts to interpret the Medicago genome through functional genomics. We developed the Medicago truncatula Gene Expression Atlas (MtGEA) web server for this purpose.
The Medicago truncatula Gene Expression Atlas (MtGEA) web server is a centralized platform for analyzing the Medicago transcriptome. Currently, the web server hosts gene expression data from 156 Affymetrix GeneChip® Medicago genome arrays in 64 different experiments, covering a broad range of developmental and environmental conditions. The server enables flexible, multifaceted analyses of transcript data and provides a range of additional information about genes, including different types of annotation and links to the genome sequence, which help users formulate hypotheses about gene function. Transcript data can be accessed using Affymetrix probe identification number, DNA sequence, gene name, functional description in natural language, GO and KEGG annotation terms, and InterPro domain number. Transcripts can also be discovered through co-expression or differential expression analysis. Flexible tools to select a subset of experiments and to visualize and compare expression profiles of multiple genes have been implemented. Data can be downloaded, in part or full, in a tabular form compatible with common analytical and visualization software. The web server will be updated on a regular basis to incorporate new gene expression data and genome annotation, and is accessible at:
The MtGEA web server has a well managed rich data set, and offers data retrieval and analysis tools provided in the web platform. It's proven to be a powerful resource for plant biologists to effectively and efficiently identify Medicago transcripts of interest from a multitude of aspects, formulate hypothesis about gene function, and overall interpret the Medicago genome from a systematic point of view.
PMCID: PMC2804685  PMID: 20028527
4.  B2G-FAR, a species-centered GO annotation repository 
Bioinformatics  2011;27(7):919-924.
Motivation: Functional genomics research has expanded enormously in the last decade thanks to the cost reduction in high-throughput technologies and the development of computational tools that generate, standardize and share information on gene and protein function such as the Gene Ontology (GO). Nevertheless, many biologists, especially working with non-model organisms, still suffer from non-existing or low-coverage functional annotation, or simply struggle retrieving, summarizing and querying these data.
Results: The Blast2GO Functional Annotation Repository (B2G-FAR) is a bioinformatics resource envisaged to provide functional information for otherwise uncharacterized sequence data and offers data mining tools to analyze a larger repertoire of species than currently available. This new annotation resource has been created by applying the Blast2GO functional annotation engine in a strongly high-throughput manner to the entire space of public available sequences. The resulting repository contains GO term predictions for over 13.2 million non-redundant protein sequences based on BLAST search alignments from the SIMAP database. We generated GO annotation for approximately 150 000 different taxa making available 2000 species with the highest coverage through B2G-FAR. A second section within B2G-FAR holds functional annotations for 17 non-model organism Affymetrix GeneChips.
Conclusions: B2G-FAR provides easy access to exhaustive functional annotation for 2000 species offering a good balance between quality and quantity, thereby supporting functional genomics research especially in the case of non-model organisms.
Availability: The annotation resource is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3065692  PMID: 21335611
5.  How to decide? Different methods of calculating gene expression from short oligonucleotide array data will give different results 
BMC Bioinformatics  2006;7:137.
Short oligonucleotide arrays for transcript profiling have been available for several years. Generally, raw data from these arrays are analysed with the aid of the Microarray Analysis Suite or GeneChip Operating Software (MAS or GCOS) from Affymetrix. Recently, more methods to analyse the raw data have become available. Ideally all these methods should come up with more or less the same results. We set out to evaluate the different methods and include work on our own data set, in order to test which method gives the most reliable results.
Calculating gene expression with 6 different algorithms (MAS5, dChip PMMM, dChip PM, RMA, GC-RMA and PDNN) using the same (Arabidopsis) data, results in different calculated gene expression levels. Consequently, depending on the method used, different genes will be identified as differentially regulated. Surprisingly, there was only 27 to 36% overlap between the different methods. Furthermore, 47.5% of the genes/probe sets showed good correlation between the mismatch and perfect match intensities.
After comparing six algorithms, RMA gave the most reproducible results and showed the highest correlation coefficients with Real Time RT-PCR data on genes identified as differentially expressed by all methods. However, we were not able to verify, by Real Time RT-PCR, the microarray results for most genes that were solely calculated by RMA. Furthermore, we conclude that subtraction of the mismatch intensity from the perfect match intensity results most likely in a significant underestimation for at least 47.5% of the expression values. Not one algorithm produced significant expression values for genes present in quantities below 1 pmol. If the only purpose of the microarray experiment is to find new candidate genes, and too many genes are found, then mutual exclusion of the genes predicted by contrasting methods can be used to narrow down the list of new candidate genes by 64 to 73%.
PMCID: PMC1431565  PMID: 16539732
6.  MIMAS: an innovative tool for network-based high density oligonucleotide microarray data management and annotation 
BMC Bioinformatics  2006;7:190.
The high-density oligonucleotide microarray (GeneChip) is an important tool for molecular biological research aiming at large-scale detection of small nucleotide polymorphisms in DNA and genome-wide analysis of mRNA concentrations. Local array data management solutions are instrumental for efficient processing of the results and for subsequent uploading of data and annotations to a global certified data repository at the EBI (ArrayExpress) or the NCBI (GeneOmnibus).
To facilitate and accelerate annotation of high-throughput expression profiling experiments, the Microarray Information Management and Annotation System (MIMAS) was developed. The system is fully compliant with the Minimal Information About a Microarray Experiment (MIAME) convention. MIMAS provides life scientists with a highly flexible and focused GeneChip data storage and annotation platform essential for subsequent analysis and interpretation of experimental results with clustering and mining tools. The system software can be downloaded for academic use upon request.
MIMAS implements a novel concept for nation-wide GeneChip data management whereby a network of facilities is centered on one data node directly connected to the European certified public microarray data repository located at the EBI. The solution proposed may serve as a prototype approach to array data management between research institutes organized in a consortium.
PMCID: PMC1459208  PMID: 16597336
7.  AffyMiner: mining differentially expressed genes and biological knowledge in GeneChip microarray data 
BMC Bioinformatics  2006;7(Suppl 4):S26.
DNA microarrays are a powerful tool for monitoring the expression of tens of thousands of genes simultaneously. With the advance of microarray technology, the challenge issue becomes how to analyze a large amount of microarray data and make biological sense of them. Affymetrix GeneChips are widely used microarrays, where a variety of statistical algorithms have been explored and used for detecting significant genes in the experiment. These methods rely solely on the quantitative data, i.e., signal intensity; however, qualitative data are also important parameters in detecting differentially expressed genes.
AffyMiner is a tool developed for detecting differentially expressed genes in Affymetrix GeneChip microarray data and for associating gene annotation and gene ontology information with the genes detected. AffyMiner consists of the functional modules, GeneFinder for detecting significant genes in a treatment versus control experiment and GOTree for mapping genes of interest onto the Gene Ontology (GO) space; and interfaces to run Cluster, a program for clustering analysis, and GenMAPP, a program for pathway analysis. AffyMiner has been used for analyzing the GeneChip data and the results were presented in several publications.
AffyMiner fills an important gap in finding differentially expressed genes in Affymetrix GeneChip microarray data. AffyMiner effectively deals with multiple replicates in the experiment and takes into account both quantitative and qualitative data in identifying significant genes. AffyMiner reduces the time and effort needed to compare data from multiple arrays and to interpret the possible biological implications associated with significant changes in a gene's expression.
PMCID: PMC1780108  PMID: 17217519
8.  MIMAS 3.0 is a Multiomics Information Management and Annotation System 
BMC Bioinformatics  2009;10:151.
DNA sequence integrity, mRNA concentrations and protein-DNA interactions have been subject to genome-wide analyses based on microarrays with ever increasing efficiency and reliability over the past fifteen years. However, very recently novel technologies for Ultra High-Throughput DNA Sequencing (UHTS) have been harnessed to study these phenomena with unprecedented precision. As a consequence, the extensive bioinformatics environment available for array data management, analysis, interpretation and publication must be extended to include these novel sequencing data types.
MIMAS was originally conceived as a simple, convenient and local Microarray Information Management and Annotation System focused on GeneChips for expression profiling studies. MIMAS 3.0 enables users to manage data from high-density oligonucleotide SNP Chips, expression arrays (both 3'UTR and tiling) and promoter arrays, BeadArrays as well as UHTS data using MIAME-compliant standardized vocabulary. Importantly, researchers can export data in MAGE-TAB format and upload them to the EBI's ArrayExpress certified data repository using a one-step procedure.
We have vastly extended the capability of the system such that it processes the data output of six types of GeneChips (Affymetrix), two different BeadArrays for mRNA and miRNA (Illumina) and the Genome Analyzer (a popular Ultra-High Throughput DNA Sequencer, Illumina), without compromising on its flexibility and user-friendliness. MIMAS, appropriately renamed into Multiomics Information Management and Annotation System, is currently used by scientists working in approximately 50 academic laboratories and genomics platforms in Switzerland and France. MIMAS 3.0 is freely available via .
PMCID: PMC2694794  PMID: 19450266
9.  easyExon – A Java-based GUI tool for processing and visualization of Affymetrix exon array data 
BMC Bioinformatics  2008;9:432.
Alternative RNA splicing greatly increases proteome diversity and thereby contribute to species- or tissue-specific functions. The possibility to study alternative splicing (AS) events on a genomic scale using splicing-sensitive microarrays, including the Affymetrix GeneChip Exon 1.0 ST microarray (exon array), has appeared very recently. However, the application of this new technology is hindered by the lack of free and user-friendly software devoted to these novel platforms.
In this study we present a Java-based freeware, easyExon , to process, filtrate and visualize exon array data with an analysis pipeline. This tool implements the most commonly used probeset summarization methods as well as AS-orientated filtration algorithms, e.g. MIDAS and PAC, for the detection of alternative splicing events. We include a biological filtration function according to GO terms, and provide a module to visualize and interpret the selected exons and transcripts. Furthermore, easyExon can integrate with other related programs, such as Integrate Genome Browser (IGB) and Affymetrix Power Tools (APT), to make the whole analysis more comprehensive. We applied easyExon on a public accessible colon cancer dataset as an example to illustrate the analysis pipeline of this tool.
EasyExon can efficiently process and analyze the Affymetrix exon array data. The simplicity, flexibility and brevity of easyExon make it a valuable tool for AS event identification in genomic research.
PMCID: PMC2579307  PMID: 18851762
10.  GEM-TREND: a web tool for gene expression data mining toward relevant network discovery 
BMC Genomics  2009;10:411.
DNA microarray technology provides us with a first step toward the goal of uncovering gene functions on a genomic scale. In recent years, vast amounts of gene expression data have been collected, much of which are available in public databases, such as the Gene Expression Omnibus (GEO). To date, most researchers have been manually retrieving data from databases through web browsers using accession numbers (IDs) or keywords, but gene-expression patterns are not considered when retrieving such data. The Connectivity Map was recently introduced to compare gene expression data by introducing gene-expression signatures (represented by a set of genes with up- or down-regulated labels according to their biological states) and is available as a web tool for detecting similar gene-expression signatures from a limited data set (approximately 7,000 expression profiles representing 1,309 compounds). In order to support researchers to utilize the public gene expression data more effectively, we developed a web tool for finding similar gene expression data and generating its co-expression networks from a publicly available database.
GEM-TREND, a web tool for searching gene expression data, allows users to search data from GEO using gene-expression signatures or gene expression ratio data as a query and retrieve gene expression data by comparing gene-expression pattern between the query and GEO gene expression data. The comparison methods are based on the nonparametric, rank-based pattern matching approach of Lamb et al. (Science 2006) with the additional calculation of statistical significance. The web tool was tested using gene expression ratio data randomly extracted from the GEO and with in-house microarray data, respectively. The results validated the ability of GEM-TREND to retrieve gene expression entries biologically related to a query from GEO. For further analysis, a network visualization interface is also provided, whereby genes and gene annotations are dynamically linked to external data repositories.
GEM-TREND was developed to retrieve gene expression data by comparing query gene-expression pattern with those of GEO gene expression data. It could be a very useful resource for finding similar gene expression profiles and constructing its gene co-expression networks from a publicly available database. GEM-TREND was designed to be user-friendly and is expected to support knowledge discovery. GEM-TREND is freely available at .
PMCID: PMC2748096  PMID: 19728865
11.  Gene regulation induced in the C57BL/6J mouse retina by hyperoxia: a temporal microarray study 
Molecular Vision  2008;14:1983-1994.
Hyperoxia is specifically toxic to photoreceptors, and this toxicity may be important in the progress of retinal dystrophies. This study examines gene expression induced in the C57BL/6J mouse retina by hyperoxia over the 14-day period during which photoreceptors first resist, then succumb to, hyperoxia.
Young adult C57BL/6J mice were exposed to hyperoxia (75% oxygen) for up to 14 days. On day 0 (control), day 3, day 7, and day 14, retinal RNA was extracted and processed on Affymetrix GeneChip® Mouse Genome 430 2.0 arrays. Microarray data were analyzed using GCOS Version 1.4 and GeneSpring Version 7.3.1. For 15 genes, microarray data were confirmed using relative quantitative real-time reverse transcription polymerase chain reaction techniques.
The overall numbers of hyperoxia-regulated genes increased monotonically with exposure. Within that increase, however, a distinctive temporal pattern was apparent. At 3 days exposure, there was prominent upregulation of genes associated with neuroprotection. By day 14, these early-responsive genes were downregulated, and genes related to cell death were strongly expressed. At day 7, the regulation of these genes was mixed, indicating a possible “transition period” from stability at day 3 to degeneration at day 14. When functional groupings of genes were analyzed separately, there was significant regulation in genes responsive to stress, genes known to cause human photoreceptor dystrophies and genes associated with apoptosis.
Microarray analysis of the response of the retina to prolonged hyperoxia demonstrated a temporal pattern involving early neuroprotection and later cell death, and provided insight into the mechanisms involved in the two phases of response. As hyperoxia is a consistent feature of the late stages of photoreceptor degenerations, understanding the mechanisms of oxygen toxicity may be important therapeutically.
PMCID: PMC2579940  PMID: 18989387
12.  Evaluation of Microarray Performance for Two RNA Amplification Methodologies 
Advances in high-throughput gene expression technologies such as microarrays have transformed our understanding of the molecular mechanisms underlying various types of biological processes and diseases. The Applause™ line of products (3'-Amp, WT-Amp ST, and WT-Amp Plus ST) addresses requirements of gene expression microarray users with high-quality RNA (50 nanograms) that want a low-cost, single-day, and reliable sample preparation solution. A comparison was made between the Applause line of amplification products and available vendor data for the GeneChip® WT cDNA Synthesis and Amplification and GeneChip3' IVT Express products. Data generated from 50 ng HeLa, MAQC A, and MAQC B RNA using the Applause 3'-Amp kit were compared to equivalent reported vendor data for the GeneChip 3' IVT Express amplification kit. NuGEN's Applause 3'-Amp outperformed the GeneChip 3' IVT Express with amplifications and labeling completed within 6 hours. It also did better than the GeneChip 3' IVT Express platform on HGU133 Plus 2.0 by percent Present Calls (%P). Differential expression analysis of MAQC samples demonstrated high correlation between the Applause 3'-Amp array data and data generated by quantitative PCR (qPCR) for the MAQC Project (Nature Biotech, 24:1151-1161 (2006)), with an R value of 0.96. Data generated from 50 ng brain and skeletal muscle RNA with the Applause WT-Amp ST and WT-Amp Plus ST kits were compared to equivalent 1 μg vendor data for GeneChip WT cDNA Synthesis and Amplification Kit. In addition to the improved speed (9 hours), the Applause WT-Amp kits also showed better QC array metrics (Pos-vs-Neg AUC, All-Probe-Set-Mean, All-Probe-Set-RLE-Mean) than the GeneChip WT cDNA Synthesis and Amplification kit. The Applause amplification system provides researchers with a fast, economical, and high-quality solution to everyday microarray needs.
PMCID: PMC2918120
13.  YersiniaBase: a genomic resource and analysis platform for comparative analysis of Yersinia 
BMC Bioinformatics  2015;16(1):9.
Yersinia is a Gram-negative bacteria that includes serious pathogens such as the Yersinia pestis, which causes plague, Yersinia pseudotuberculosis, Yersinia enterocolitica. The remaining species are generally considered non-pathogenic to humans, although there is evidence that at least some of these species can cause occasional infections using distinct mechanisms from the more pathogenic species. With the advances in sequencing technologies, many genomes of Yersinia have been sequenced. However, there is currently no specialized platform to hold the rapidly-growing Yersinia genomic data and to provide analysis tools particularly for comparative analyses, which are required to provide improved insights into their biology, evolution and pathogenicity.
To facilitate the ongoing and future research of Yersinia, especially those generally considered non-pathogenic species, a well-defined repository and analysis platform is needed to hold the Yersinia genomic data and analysis tools for the Yersinia research community. Hence, we have developed the YersiniaBase, a robust and user-friendly Yersinia resource and analysis platform for the analysis of Yersinia genomic data. YersiniaBase has a total of twelve species and 232 genome sequences, of which the majority are Yersinia pestis. In order to smooth the process of searching genomic data in a large database, we implemented an Asynchronous JavaScript and XML (AJAX)-based real-time searching system in YersiniaBase. Besides incorporating existing tools, which include JavaScript-based genome browser (JBrowse) and Basic Local Alignment Search Tool (BLAST), YersiniaBase also has in-house developed tools: (1) Pairwise Genome Comparison tool (PGC) for comparing two user-selected genomes; (2) Pathogenomics Profiling Tool (PathoProT) for comparative pathogenomics analysis of Yersinia genomes; (3) YersiniaTree for constructing phylogenetic tree of Yersinia. We ran analyses based on the tools and genomic data in YersiniaBase and the preliminary results showed differences in virulence genes found in Yersinia pestis and Yersinia pseudotuberculosis compared to other Yersinia species, and differences between Yersinia enterocolitica subsp. enterocolitica and Yersinia enterocolitica subsp. palearctica.
YersiniaBase offers free access to wide range of genomic data and analysis tools for the analysis of Yersinia. YersiniaBase can be accessed at
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0422-y) contains supplementary material, which is available to authorized users.
PMCID: PMC4384384  PMID: 25591325
YersiniaBase; Yersinia; Genomic resources; Comparative analysis
14.  A statistical method for predicting splice variants between two groups of samples using GeneChip® expression array data 
Alternative splicing of pre-messenger RNA results in RNA variants with combinations of selected exons. It is one of the essential biological functions and regulatory components in higher eukaryotic cells. Some of these variants are detectable with the Affymetrix GeneChip® that uses multiple oligonucleotide probes (i.e. probe set), since the target sequences for the multiple probes are adjacent within each gene. Hybridization intensity from a probe correlates with abundance of the corresponding transcript. Although the multiple-probe feature in the current GeneChip® was designed to assess expression values of individual genes, it also measures transcriptional abundance for a sub-region of a gene sequence. This additional capacity motivated us to develop a method to predict alternative splicing, taking advance of extensive repositories of GeneChip® gene expression array data.
We developed a two-step approach to predict alternative splicing from GeneChip® data. First, we clustered the probes from a probe set into pseudo-exons based on similarity of probe intensities and physical adjacency. A pseudo-exon is defined as a sequence in the gene within which multiple probes have comparable probe intensity values. Second, for each pseudo-exon, we assessed the statistical significance of the difference in probe intensity between two groups of samples. Differentially expressed pseudo-exons are predicted to be alternatively spliced. We applied our method to empirical data generated from GeneChip® Hu6800 arrays, which include 7129 probe sets and twenty probes per probe set. The dataset consists of sixty-nine medulloblastoma (27 metastatic and 42 non-metastatic) samples and four cerebellum samples as normal controls. We predicted that 577 genes would be alternatively spliced when we compared normal cerebellum samples to medulloblastomas, and predicted that thirteen genes would be alternatively spliced when we compared metastatic medulloblastomas to non-metastatic ones. We checked the consistency of some of our findings with information in UCSC Human Genome Browser.
The two-step approach described in this paper is capable of predicting some alternative splicing from multiple oligonucleotide-based gene expression array data with GeneChip® technology. Our method employs the extensive repositories of gene expression array data available and generates alternative splicing hypotheses, which can be further validated by experimental studies.
PMCID: PMC1502129  PMID: 16603076
15.  Analysis and Interpretation of Multiple Proteomic Datasets Biologically Relevant Information Obtained in Less than 3 Hours 
A growing issue in proteomics is data interpretation, in particular the time-consuming steps of analyzing proteins and peptides identified by database search engines, extracting biologically meaningful information, and sharing results with co-workers and collaborators. This poster shows the application of a bioinformatics tool that was used to manage protein and peptide lists, put them into a biological context and enable the results to be shared directly. Within 3 hours, output generated by protein database search engines sourced from multiple proteomic data sets, was translated into biological information and shared with collaborators.The bioinformatics tool, ProteinCenter (Proxeon) uses biological annotations from multiple resources to produce a biologically-relevant overview in large-scale proteomics studies. ProteinCenter contains public sequence databases to form a comprehensive and consistent superset of 12 million protein sequences derived from over 50 million protein records from GenBank, Refseq, EMBL, UniProt, Swiss-Prot, Trembl, PIR, IPI, PDB, Ensembl etc., including more than 5 million outdated accession numbers. The ProteinCenter database is built using Sun Java technology and Microsoft mySQL database technology for optimal performance. Here we present the bioinformatic analysis and comparison of proteomics datasets derived from the PRIDE database, including an organ-specific proteome map for Arabidopsis thaliana and HUPO projects. Protein identifications were clustered using ProteinCenter algorithms based on indistinguishable proteins or sequence homology. The results of ProteinCenter data processing will be presented including statistical analysis of over- and under-represented features such as gene ontology categories, PFAM annotated proteins, signal peptide proteins, trans-membrane annotated proteins, enzymes, involvement in KEGG pathways and others.A bioinformatics tool that gives access to all major protein repositories enabled analysis and interpretation of multiple, large scale proteomics datasets, sourced directly from search engines, to be performed within 2-3 hours.
PMCID: PMC2918002
16.  geWorkbench: an open source platform for integrative genomics 
Bioinformatics  2010;26(14):1779-1780.
Summary: geWorkbench (genomics Workbench) is an open source Java desktop application that provides access to an integrated suite of tools for the analysis and visualization of data from a wide range of genomics domains (gene expression, sequence, protein structure and systems biology). More than 70 distinct plug-in modules are currently available implementing both classical analyses (several variants of clustering, classification, homology detection, etc.) as well as state of the art algorithms for the reverse engineering of regulatory networks and for protein structure prediction, among many others. geWorkbench leverages standards-based middleware technologies to provide seamless access to remote data, annotation and computational servers, thus, enabling researchers with limited local resources to benefit from available public infrastructure.
Availability: The project site ( includes links to self-extracting installers for most operating system (OS) platforms as well as instructions for building the application from scratch using the source code [which is freely available from the project's SVN (subversion) repository]. geWorkbench support is available through the end-user and developer forums of the caBIG® Molecular Analysis Tools Knowledge Center,
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2894520  PMID: 20511363
17.  BEAT: Bioinformatics Exon Array Tool to store, analyze and visualize Affymetrix GeneChip Human Exon Array data from disease experiments 
BMC Bioinformatics  2012;13(Suppl 4):S21.
It is known from recent studies that more than 90% of human multi-exon genes are subject to Alternative Splicing (AS), a key molecular mechanism in which multiple transcripts may be generated from a single gene. It is widely recognized that a breakdown in AS mechanisms plays an important role in cellular differentiation and pathologies. Polymerase Chain Reactions, microarrays and sequencing technologies have been applied to the study of transcript diversity arising from alternative expression. Last generation Affymetrix GeneChip Human Exon 1.0 ST Arrays offer a more detailed view of the gene expression profile providing information on the AS patterns. The exon array technology, with more than five million data points, can detect approximately one million exons, and it allows performing analyses at both gene and exon level. In this paper we describe BEAT, an integrated user-friendly bioinformatics framework to store, analyze and visualize exon arrays datasets. It combines a data warehouse approach with some rigorous statistical methods for assessing the AS of genes involved in diseases. Meta statistics are proposed as a novel approach to explore the analysis results. BEAT is available at
BEAT is a web tool which allows uploading and analyzing exon array datasets using standard statistical methods and an easy-to-use graphical web front-end. BEAT has been tested on a dataset with 173 samples and tuned using new datasets of exon array experiments from 28 colorectal cancer and 26 renal cell cancer samples produced at the Medical Genetics Unit of IRCCS Casa Sollievo della Sofferenza.
To highlight all possible AS events, alternative names, accession Ids, Gene Ontology terms and biochemical pathways annotations are integrated with exon and gene level expression plots. The user can customize the results choosing custom thresholds for the statistical parameters and exploiting the available clinical data of the samples for a multivariate AS analysis.
Despite exon array chips being widely used for transcriptomics studies, there is a lack of analysis tools offering advanced statistical features and requiring no programming knowledge. BEAT provides a user-friendly platform for a comprehensive study of AS events in human diseases, displaying the analysis results with easily interpretable and interactive tables and graphics.
PMCID: PMC3314565  PMID: 22536968
18.  GarlicESTdb: an online database and mining tool for garlic EST sequences 
BMC Plant Biology  2009;9:61.
Allium sativum., commonly known as garlic, is a species in the onion genus (Allium), which is a large and diverse one containing over 1,250 species. Its close relatives include chives, onion, leek and shallot. Garlic has been used throughout recorded history for culinary, medicinal use and health benefits. Currently, the interest in garlic is highly increasing due to nutritional and pharmaceutical value including high blood pressure and cholesterol, atherosclerosis and cancer. For all that, there are no comprehensive databases available for Expressed Sequence Tags(EST) of garlic for gene discovery and future efforts of genome annotation. That is why we developed a new garlic database and applications to enable comprehensive analysis of garlic gene expression.
GarlicESTdb is an integrated database and mining tool for large-scale garlic (Allium sativum) EST sequencing. A total of 21,595 ESTs collected from an in-house cDNA library were used to construct the database. The analysis pipeline is an automated system written in JAVA and consists of the following components: automatic preprocessing of EST reads, assembly of raw sequences, annotation of the assembled sequences, storage of the analyzed information into MySQL databases, and graphic display of all processed data. A web application was implemented with the latest J2EE (Java 2 Platform Enterprise Edition) software technology (JSP/EJB/JavaServlet) for browsing and querying the database, for creation of dynamic web pages on the client side, and for mapping annotated enzymes to KEGG pathways, the AJAX framework was also used partially. The online resources, such as putative annotation, single nucleotide polymorphisms (SNP) and tandem repeat data sets, can be searched by text, explored on the website, searched using BLAST, and downloaded. To archive more significant BLAST results, a curation system was introduced with which biologists can easily edit best-hit annotation information for others to view. The GarlicESTdb web application is freely available at .
GarlicESTdb is the first incorporated online information database of EST sequences isolated from garlic that can be freely accessed and downloaded. It has many useful features for interactive mining of EST contigs and datasets from each library, including curation of annotated information, expression profiling, information retrieval, and summary of statistics of functional annotation. Consequently, the development of GarlicESTdb will provide a crucial contribution to biologists for data-mining and more efficient experimental studies.
PMCID: PMC2689220  PMID: 19445732
19.  Novel markers for differentiation of lobular and ductal invasive breast carcinomas by laser microdissection and microarray analysis 
BMC Cancer  2007;7:55.
Invasive ductal and lobular carcinomas (IDC and ILC) are the most common histological types of breast cancer. Clinical follow-up data and metastatic patterns suggest that the development and progression of these tumors are different. The aim of our study was to identify gene expression profiles of IDC and ILC in relation to normal breast epithelial cells.
We examined 30 samples (normal ductal and lobular cells from 10 patients, IDC cells from 5 patients, ILC cells from 5 patients) microdissected from cryosections of ten mastectomy specimens from postmenopausal patients. Fifty nanograms of total RNA were amplified and labeled by PCR and in vitro transcription. Samples were analysed upon Affymetrix U133 Plus 2.0 Arrays. The expression of seven differentially expressed genes (CDH1, EMP1, DDR1, DVL1, KRT5, KRT6, KRT17) was verified by immunohistochemistry on tissue microarrays. Expression of ASPN mRNA was validated by in situ hybridization on frozen sections, and CTHRC1, ASPN and COL3A1 were tested by PCR.
Using GCOS pairwise comparison algorithm and rank products we have identified 84 named genes common to ILC versus normal cell types, 74 named genes common to IDC versus normal cell types, 78 named genes differentially expressed between normal ductal and lobular cells, and 28 named genes between IDC and ILC. Genes distinguishing between IDC and ILC are involved in epithelial-mesenchymal transition, TGF-beta and Wnt signaling. These changes were present in both tumor types but appeared to be more prominent in ILC. Immunohistochemistry for several novel markers (EMP1, DVL1, DDR1) distinguished large sets of IDC from ILC.
IDC and ILC can be differentiated both at the gene and protein levels. In this study we report two candidate genes, asporin (ASPN) and collagen triple helix repeat containing 1 (CTHRC1) which might be significant in breast carcinogenesis. Besides E-cadherin, the proteins validated on tissue microarrays (EMP1, DVL1, DDR1) may represent novel immunohistochemical markers helpful in distinguishing between IDC and ILC. Further studies with larger sets of patients are needed to verify the gene expression profiles of various histological types of breast cancer in order to determine molecular subclassifications, prognosis and the optimum treatment strategies.
PMCID: PMC1852112  PMID: 17389037
20.  KARMA: a web server application for comparing and annotating heterogeneous microarray platforms 
Nucleic Acids Research  2004;32(Web Server issue):W441-W444.
We have developed a universal web server application (KARMA) that allows comparison and annotation of user-defined pairs of microarray platforms based on diverse types of genome annotation data (across different species) collected from multiple sources. The application is an effective tool for diverse microarray platforms, including arrays that are provided by (i) the Keck Microarray Resource at Yale, (ii) commercially available Affymetrix GeneChips® and spotted arrays and (iii) custom arrays made by individual academics. The tool provides a web interface that allows users to input pairs of test files that represent diverse array platforms for either single or multiple species. The program dynamically identifies analogous DNA fragments spotted or synthesized on multiple microarray platforms based on the following types of information: (i) NCBI-Unigene identifiers, if the platforms being compared are within the same species or (ii) NCBI-Homologene data, if they are cross-species. The single-species comparison is implemented based on set operations: intersection, union and difference. Other forms of retrievable annotation data, including LocusLink, SwissProt and Gene Ontology (GO), are collected from multiple remote sites and stored in an integrated fashion using an Oracle database. The KARMA database, which is updated periodically, is available on line at the following URL:
PMCID: PMC441535  PMID: 15215426
21.  MiMiR – an integrated platform for microarray data sharing, mining and analysis 
BMC Bioinformatics  2008;9:379.
Despite considerable efforts within the microarray community for standardising data format, content and description, microarray technologies present major challenges in managing, sharing, analysing and re-using the large amount of data generated locally or internationally. Additionally, it is recognised that inconsistent and low quality experimental annotation in public data repositories significantly compromises the re-use of microarray data for meta-analysis. MiMiR, the Microarray data Mining Resource was designed to tackle some of these limitations and challenges. Here we present new software components and enhancements to the original infrastructure that increase accessibility, utility and opportunities for large scale mining of experimental and clinical data.
A user friendly Online Annotation Tool allows researchers to submit detailed experimental information via the web at the time of data generation rather than at the time of publication. This ensures the easy access and high accuracy of meta-data collected. Experiments are programmatically built in the MiMiR database from the submitted information and details are systematically curated and further annotated by a team of trained annotators using a new Curation and Annotation Tool. Clinical information can be annotated and coded with a clinical Data Mapping Tool within an appropriate ethical framework. Users can visualise experimental annotation, assess data quality, download and share data via a web-based experiment browser called MiMiR Online. All requests to access data in MiMiR are routed through a sophisticated middleware security layer thereby allowing secure data access and sharing amongst MiMiR registered users prior to publication. Data in MiMiR can be mined and analysed using the integrated EMAAS open source analysis web portal or via export of data and meta-data into Rosetta Resolver data analysis package.
The new MiMiR suite of software enables systematic and effective capture of extensive experimental and clinical information with the highest MIAME score, and secure data sharing prior to publication. MiMiR currently contains more than 150 experiments corresponding to over 3000 hybridisations and supports the Microarray Centre's large microarray user community and two international consortia. The MiMiR flexible and scalable hardware and software architecture enables secure warehousing of thousands of datasets, including clinical studies, from microarray and potentially other -omics technologies.
PMCID: PMC2572073  PMID: 18801157
22.  The MetabolomeExpress Project: enabling web-based processing, analysis and transparent dissemination of GC/MS metabolomics datasets 
BMC Bioinformatics  2010;11:376.
Standardization of analytical approaches and reporting methods via community-wide collaboration can work synergistically with web-tool development to result in rapid community-driven expansion of online data repositories suitable for data mining and meta-analysis. In metabolomics, the inter-laboratory reproducibility of gas-chromatography/mass-spectrometry (GC/MS) makes it an obvious target for such development. While a number of web-tools offer access to datasets and/or tools for raw data processing and statistical analysis, none of these systems are currently set up to act as a public repository by easily accepting, processing and presenting publicly submitted GC/MS metabolomics datasets for public re-analysis.
Here, we present MetabolomeExpress, a new File Transfer Protocol (FTP) server and web-tool for the online storage, processing, visualisation and statistical re-analysis of publicly submitted GC/MS metabolomics datasets. Users may search a quality-controlled database of metabolite response statistics from publicly submitted datasets by a number of parameters (eg. metabolite, species, organ/biofluid etc.). Users may also perform meta-analysis comparisons of multiple independent experiments or re-analyse public primary datasets via user-friendly tools for t-test, principal components analysis, hierarchical cluster analysis and correlation analysis. They may interact with chromatograms, mass spectra and peak detection results via an integrated raw data viewer. Researchers who register for a free account may upload (via FTP) their own data to the server for online processing via a novel raw data processing pipeline.
MetabolomeExpress provides a new opportunity for the general metabolomics community to transparently present online the raw and processed GC/MS data underlying their metabolomics publications. Transparent sharing of these data will allow researchers to assess data quality and draw their own insights from published metabolomics datasets.
PMCID: PMC2912306  PMID: 20626915
23.  SIGMA: A System for Integrative Genomic Microarray Analysis of Cancer Genomes 
BMC Genomics  2006;7:324.
The prevalence of high resolution profiling of genomes has created a need for the integrative analysis of information generated from multiple methodologies and platforms. Although the majority of data in the public domain are gene expression profiles, and expression analysis software are available, the increase of array CGH studies has enabled integration of high throughput genomic and gene expression datasets. However, tools for direct mining and analysis of array CGH data are limited. Hence, there is a great need for analytical and display software tailored to cross platform integrative analysis of cancer genomes.
We have created a user-friendly java application to facilitate sophisticated visualization and analysis such as cross-tumor and cross-platform comparisons. To demonstrate the utility of this software, we assembled array CGH data representing Affymetrix SNP chip, Stanford cDNA arrays and whole genome tiling path array platforms for cross comparison. This cancer genome database contains 267 profiles from commonly used cancer cell lines representing 14 different tissue types.
In this study we have developed an application for the visualization and analysis of data from high resolution array CGH platforms that can be adapted for analysis of multiple types of high throughput genomic datasets. Furthermore, we invite researchers using array CGH technology to deposit both their raw and processed data, as this will be a continually expanding database of cancer genomes. This publicly available resource, the System for Integrative Genomic Microarray Analysis (SIGMA) of cancer genomes, can be accessed at .
PMCID: PMC1764892  PMID: 17192189
24.  LegumeGRN: A Gene Regulatory Network Prediction Server for Functional and Comparative Studies 
PLoS ONE  2013;8(7):e67434.
Building accurate gene regulatory networks (GRNs) from high-throughput gene expression data is a long-standing challenge. However, with the emergence of new algorithms combined with the increase of transcriptomic data availability, it is now reachable. To help biologists to investigate gene regulatory relationships, we developed a web-based computational service to build, analyze and visualize GRNs that govern various biological processes. The web server is preloaded with all available Affymetrix GeneChip-based transcriptomic and annotation data from the three model legume species, i.e., Medicago truncatula, Lotus japonicus and Glycine max. Users can also upload their own transcriptomic and transcription factor datasets from any other species/organisms to analyze their in-house experiments. Users are able to select which experiments, genes and algorithms they will consider to perform their GRN analysis. To achieve this flexibility and improve prediction performance, we have implemented multiple mainstream GRN prediction algorithms including co-expression, Graphical Gaussian Models (GGMs), Context Likelihood of Relatedness (CLR), and parallelized versions of TIGRESS and GENIE3. Besides these existing algorithms, we also proposed a parallel Bayesian network learning algorithm, which can infer causal relationships (i.e., directionality of interaction) and scale up to several thousands of genes. Moreover, this web server also provides tools to allow integrative and comparative analysis between predicted GRNs obtained from different algorithms or experiments, as well as comparisons between legume species. The web site is available at
PMCID: PMC3701055  PMID: 23844010
25.  Identification of Candidate Genes in Scleroderma-Related Pulmonary Arterial Hypertension 
We hypothesize that pulmonary arterial hypertension (PAH)-associated genes identified by expression profiling of peripheral blood mononuclear cells (PBMCs) from patients with idiopathic pulmonary arterial hypertension (IPAH) can also be identified in PBMCs from scleroderma patients with PAH (PAH-SSc). Gene expression profiles of PBMCs collected from IPAH (n=9), PAH-SSc (n=10) patients and healthy controls (n=5) were generated using HG_U133A_2.0 GeneChips and processed by RMA/GCOS_1.4/SAM_1.21 data analysis pipeline. Disease severity in consecutive patients was assessed by functional status and hemodynamic measurements. The expression profiles were analyzed using PAH severity-stratification, and identified candidate genes were validated with real time PCR (rtPCR). Transcriptomics of PBMCs from IPAH patients was highly comparable with that of PMBCs from PAH-SSc patients. The PBMC gene expression patterns significantly correlate with right atrium pressure (RA) and cardiac index (CI), known predictors of survival in PAH. Array stratification by RA and CI identified 364 PAH-associated candidate genes. Gene ontology analysis revealed significant (Zscore > 1.96) alterations in angiogenesis genes according to PAH severity: MMP9 and VEGF were significantly upregulated in mild as compared to severe PAH and healthy controls, as confirmed by rtPCR. These data demonstrate that PBMCs from patients with PAH-SSc carry distinct transcriptional expression. Furthermore, our findings suggest an association between angiogenesis-related gene expression and severity of PAH in PAH-SSc patients. Deciphering the role of genes involved in vascular remodeling and PAH development may reveal new treatment targets for this devastating disorder.
PMCID: PMC2359723  PMID: 18355767

