Search tips
Search criteria

Results 1-25 (1104729)

Clipboard (0)

Related Articles

1.  ArrayExpress—a public database of microarray experiments and gene expression profiles 
Nucleic Acids Research  2006;35(Database issue):D747-D750.
ArrayExpress is a public database for high throughput functional genomics data. ArrayExpress consists of two parts—the ArrayExpress Repository, which is a MIAME supportive public archive of microarray data, and the ArrayExpress Data Warehouse, which is a database of gene expression profiles selected from the repository and consistently re-annotated. Archived experiments can be queried by experiment attributes, such as keywords, species, array platform, authors, journals or accession numbers. Gene expression profiles can be queried by gene names and properties, such as Gene Ontology terms and gene expression profiles can be visualized. ArrayExpress is a rapidly growing database, currently it contains data from >50 000 hybridizations and >1 500 000 individual expression profiles. ArrayExpress supports community standards, including MIAME, MAGE-ML and more recently the proposal for a spreadsheet based data exchange format: MAGE-TAB. Availability: .
PMCID: PMC1716725  PMID: 17132828
2.  ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments 
Nucleic Acids Research  2010;39(Database issue):D1002-D1004.
The ArrayExpress Archive ( is one of the three international public repositories of functional genomics data supporting publications. It includes data generated by sequencing or array-based technologies. Data are submitted by users and imported directly from the NCBI Gene Expression Omnibus. The ArrayExpress Archive is closely integrated with the Gene Expression Atlas and the sequence databases at the European Bioinformatics Institute. Advanced queries provided via ontology enabled interfaces include queries based on technology and sample attributes such as disease, cell types and anatomy.
PMCID: PMC3013660  PMID: 21071405
3.  ArrayExpress update—trends in database growth and links to data analysis tools 
Nucleic Acids Research  2012;41(Database issue):D987-D990.
The ArrayExpress Archive of Functional Genomics Data ( is one of three international functional genomics public data repositories, alongside the Gene Expression Omnibus at NCBI and the DDBJ Omics Archive, supporting peer-reviewed publications. It accepts data generated by sequencing or array-based technologies and currently contains data from almost a million assays, from over 30 000 experiments. The proportion of sequencing-based submissions has grown significantly over the last 2 years and has reached, in 2012, 15% of all new data. All data are available from ArrayExpress in MAGE-TAB format, which allows robust linking to data analysis and visualization tools, including Bioconductor and GenomeSpace. Additionally, R objects, for microarray data, and binary alignment format files, for sequencing data, have been generated for a significant proportion of ArrayExpress data.
PMCID: PMC3531147  PMID: 23193272
4.  ArrayExpress—a public repository for microarray gene expression data at the EBI 
Nucleic Acids Research  2003;31(1):68-71.
ArrayExpress is a new public database of microarray gene expression data at the EBI, which is a generic gene expression database designed to hold data from all microarray platforms. ArrayExpress uses the annotation standard Minimum Information About a Microarray Experiment (MIAME) and the associated XML data exchange format Microarray Gene Expression Markup Language (MAGE-ML) and it is designed to store well annotated data in a structured way. The ArrayExpress infrastructure consists of the database itself, data submissions in MAGE-ML format or via an online submission tool MIAMExpress, online database query interface, and the Expression Profiler online analysis tool. ArrayExpress accepts three types of submission, arrays, experiments and protocols, each of these is assigned an accession number. Help on data submission and annotation is provided by the curation team. The database can be queried on parameters such as author, laboratory, organism, experiment or array types. With an increasing number of organisations adopting MAGE-ML standard, the volume of submissions to ArrayExpress is increasing rapidly. The database can be accessed at
PMCID: PMC165538  PMID: 12519949
5.  Gene Expression Atlas at the European Bioinformatics Institute 
Nucleic Acids Research  2009;38(Database issue):D690-D698.
The Gene Expression Atlas ( is an added-value database providing information about gene expression in different cell types, organism parts, developmental stages, disease states, sample treatments and other biological/experimental conditions. The content of this database derives from curation, re-annotation and statistical analysis of selected data from the ArrayExpress Archive of Functional Genomics Data. A simple interface allows the user to query for differential gene expression either (i) by gene names or attributes such as Gene Ontology terms, or (ii) by biological conditions, e.g. diseases, organism parts or cell types. The gene queries return the conditions where expression has been reported, while condition queries return which genes are reported to be expressed in these conditions. A combination of both query types is possible. The query results are ranked using various statistical measures and by how many independent studies in the database show the particular gene-condition association. Currently, the database contains information about more than 200 000 genes from nine species and almost 4500 biological conditions studied in over 30 000 assays from over 1000 independent studies.
PMCID: PMC2808905  PMID: 19906730
6.  A massive human co-expression-network and its medical applications 
Chemistry & biodiversity  2012;9(5):868-887.
Network-based analysis is indispensable in analyzing high throughput biological data. Based on the assumption that the variation of gene interactions under given biological conditions could be better interpreted in the context of a large-scale and wide variety of developmental, tissue, and disease, we leverage the large quantity of publicly-available transcriptomic data > 40,000 HG U133A Affymetrix microarray chips stored in ArrayExpress ( using MetaOmGraph ( From this data, 18,637 chips encompassing over 500 experiments containing high quality data (18637Hu-dataset) were used to create a globally stable gene co-expression network (18637Hu-co-expression-network). Regulons, groups of highly and consistently co-expressed genes, were obtained by partitioning the 18637Hu-co-expression-network using an MCL clustering algorithm. The regulon were demonstrated to be statistically significant using a gene ontology (GO) term overrepresentation test combined with evaluation of the effects of gene permutations. The regulons include approximately 12% of human genes, interconnected by 31,471 correlations. All network data and metadata is publically available ( Text mining of these metadata, GO term overrepresentation analysis, and statistical analysis of transcriptomic experiments across multiple environmental, tissue, and disease conditions, has revealed novel fingerprints distinguishing central nervous system (CNS)-related conditions. This study demonstrates the value of mega-scale network-based analysis for biologists to further refine transcriptomic data derived from a particular condition, to study the global relationships between genes and diseases, and to develop hypotheses that can inform future research.
PMCID: PMC3711686  PMID: 22589089
7.  Gene Expression Atlas update—a value-added database of microarray and sequencing-based functional genomics experiments 
Nucleic Acids Research  2011;40(Database issue):D1077-D1081.
Gene Expression Atlas ( is an added-value database providing information about gene expression in different cell types, organism parts, developmental stages, disease states, sample treatments and other biological/experimental conditions. The content of this database derives from curation, re-annotation and statistical analysis of selected data from the ArrayExpress Archive and the European Nucleotide Archive. A simple interface allows the user to query for differential gene expression either by gene names or attributes or by biological conditions, e.g. diseases, organism parts or cell types. Since our previous report we made 20 monthly releases and, as of Release 11.08 (August 2011), the database supports 19 species, which contains expression data measured for 19 014 biological conditions in 136 551 assays from 5598 independent studies.
PMCID: PMC3245177  PMID: 22064864
8.  Gene expression patterns unveil a new level of molecular heterogeneity in colorectal cancer 
The Journal of Pathology  2013;231(1):63-76.
The recognition that colorectal cancer (CRC) is a heterogeneous disease in terms of clinical behaviour and response to therapy translates into an urgent need for robust molecular disease subclassifiers that can explain this heterogeneity beyond current parameters (MSI, KRAS, BRAF). Attempts to fill this gap are emerging. The Cancer Genome Atlas (TGCA) reported two main CRC groups, based on the incidence and spectrum of mutated genes, and another paper reported an EMT expression signature defined subgroup. We performed a prior free analysis of CRC heterogeneity on 1113 CRC gene expression profiles and confronted our findings to established molecular determinants and clinical, histopathological and survival data. Unsupervised clustering based on gene modules allowed us to distinguish at least five different gene expression CRC subtypes, which we call surface crypt-like, lower crypt-like, CIMP-H-like, mesenchymal and mixed. A gene set enrichment analysis combined with literature search of gene module members identified distinct biological motifs in different subtypes. The subtypes, which were not derived based on outcome, nonetheless showed differences in prognosis. Known gene copy number variations and mutations in key cancer-associated genes differed between subtypes, but the subtypes provided molecular information beyond that contained in these variables. Morphological features significantly differed between subtypes. The objective existence of the subtypes and their clinical and molecular characteristics were validated in an independent set of 720 CRC expression profiles. Our subtypes provide a novel perspective on the heterogeneity of CRC. The proposed subtypes should be further explored retrospectively on existing clinical trial datasets and, when sufficiently robust, be prospectively assessed for clinical relevance in terms of prognosis and treatment response predictive capacity. Original microarray data were uploaded to the ArrayExpress database ( under Accession Nos E-MTAB-990 and E-MTAB-1026.
PMCID: PMC3840702  PMID: 23836465
colorectal cancer; histopathology; gene expression; molecular heterogeneity
9.  Expression Atlas update—a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments 
Nucleic Acids Research  2013;42(Database issue):D926-D932.
Expression Atlas ( is a value-added database providing information about gene, protein and splice variant expression in different cell types, organism parts, developmental stages, diseases and other biological and experimental conditions. The database consists of selected high-quality microarray and RNA-sequencing experiments from ArrayExpress that have been manually curated, annotated with Experimental Factor Ontology terms and processed using standardized microarray and RNA-sequencing analysis methods. The new version of Expression Atlas introduces the concept of ‘baseline’ expression, i.e. gene and splice variant abundance levels in healthy or untreated conditions, such as tissues or cell types. Differential gene expression data benefit from an in-depth curation of experimental intent, resulting in biologically meaningful ‘contrasts’, i.e. instances of differential pairwise comparisons between two sets of biological replicates. Other novel aspects of Expression Atlas are its strict quality control of raw experimental data, up-to-date RNA-sequencing analysis methods, expression data at the level of gene sets, as well as genes and a more powerful search interface designed to maximize the biological value provided to the user.
PMCID: PMC3964963  PMID: 24304889
10.  The BioSample Database (BioSD) at the European Bioinformatics Institute 
Nucleic Acids Research  2011;40(Database issue):D64-D70.
The BioSample Database ( is a new database at EBI that stores information about biological samples used in molecular experiments, such as sequencing, gene expression or proteomics. The goals of the BioSample Database include: (i) recording and linking of sample information consistently within EBI databases such as ENA, ArrayExpress and PRIDE; (ii) minimizing data entry efforts for EBI database submitters by enabling submitting sample descriptions once and referencing them later in data submissions to assay databases and (iii) supporting cross database queries by sample characteristics. Each sample in the database is assigned an accession number. The database includes a growing set of reference samples, such as cell lines, which are repeatedly used in experiments and can be easily referenced from any database by their accession numbers. Accession numbers for the reference samples will be exchanged with a similar database at NCBI. The samples in the database can be queried by their attributes, such as sample types, disease names or sample providers. A simple tab-delimited format facilitates submissions of sample information to the database, initially via email to
PMCID: PMC3245134  PMID: 22096232
11.  Lung response to Bordetella pertussis infection in mice identified by gene-expression profiling 
Immunogenetics  2007;59(7):555-564.
Host genetics determines the course of Bordetella pertussis infection in mice. Previously, we found four loci, Tlr4 and three novel loci, designated Bps 1–3, that are involved in the control of B. pertussis infection. The purpose of the present study was to identify candidate genes that could explain genetic differences in the course of B. pertussis infection, assuming that such genes are differentially regulated upon infection. We, therefore, studied the course of mRNA expression in the lungs after B. pertussis infection. Of the 22,000 genes investigated, 1,841 were significantly differentially expressed with 1,182 genes upregulated and 659 genes downregulated. Upregulated genes were involved in immune-related processes, such as the acute-phase response, antigen presentation, cytokine production, inflammation, and apoptosis, while downregulated genes were mainly involved in nonimmune processes, such as development and muscle contraction. Pathway analysis revealed the involvement of granulocyte function, toll-like receptor signaling pathway, and apoptosis. Nine of the differentially expressed genes were located in Bps-1, 13 were located in Bps-2, and 62 were located in Bps-3. We conclude that B. pertussis infection induces a wide and complex response, which appears to be partly specific for B. pertussis and partly nonspecific. We envisage that these data will be helpful in identifying polymorphic genes that affect the susceptibility and course of B. pertussis infection in humans.
Electronic supplementary material
The online version of this article (doi:10.1007/s00251-007-0227-5) contains supplementary material, which is available to authorized users. Raw and normalized data of the experiment can be accessed at the online database ArrayExpress
PMCID: PMC1914303  PMID: 17487483
Bordetella pertussis; Expression profiles; Gene expression; Immunity; Pathway analysis
12.  The European Bioinformatics Institute's data resources 
Nucleic Acids Research  2003;31(1):43-50.
As the amount of biological data grows, so does the need for biologists to store and access this information in central repositories in a free and unambiguous manner. The European Bioinformatics Institute (EBI) hosts six core databases, which store information on DNA sequences (EMBL-Bank), protein sequences (SWISS-PROT and TrEMBL), protein structure (MSD), whole genomes (Ensembl) and gene expression (ArrayExpress). But just as a cell would be useless if it couldn't transcribe DNA or translate RNA, our resources would be compromised if each existed in isolation. We have therefore developed a range of tools that not only facilitate the deposition and retrieval of biological information, but also allow users to carry out searches that reflect the interconnectedness of biological information. The EBI's databases and tools are all available on our website at
PMCID: PMC165513  PMID: 12519944
13.  Microarray retriever: a web-based tool for searching and large scale retrieval of public microarray data 
Nucleic Acids Research  2008;36(Web Server issue):W327-W331.
The major public microarray repositories Gene Expression Omnibus and ArrayExpress are growing rapidly. This enables meta-analysis studies, in which expression data from multiple individual studies are combined. To facilitate these types of studies, we developed Microarray Retriever for searching and retrieval of data from GEO and ArrayExpress. The tool allows access to the two repositories simultaneously, to search in the repositories using complex queries, to retrieve microarray data for published articles and to download data in one structured archive. The tool is available on the web at:
PMCID: PMC2447788  PMID: 18463138
14.  arrayMap: A Reference Resource for Genomic Copy Number Imbalances in Human Malignancies 
PLoS ONE  2012;7(5):e36944.
The delineation of genomic copy number abnormalities (CNAs) from cancer samples has been instrumental for identification of tumor suppressor genes and oncogenes and proven useful for clinical marker detection. An increasing number of projects have mapped CNAs using high-resolution microarray based techniques. So far, no single resource does provide a global collection of readily accessible oncogenomic array data.
Methodology/Principal Findings
We here present arrayMap, a curated reference database and bioinformatics resource targeting copy number profiling data in human cancer. The arrayMap database provides a platform for meta-analysis and systems level data integration of high-resolution oncogenomic CNA data. To date, the resource incorporates more than 40,000 arrays in 224 cancer types extracted from several resources, including the NCBI’s Gene Expression Omnibus (GEO), EBI’s ArrayExpress (AE), The Cancer Genome Atlas (TCGA), publication supplements and direct submissions. For the majority of the included datasets, probe level and integrated visualization facilitate gene level and genome wide data review. Results from multi-case selections can be connected to downstream data analysis and visualization tools.
To our knowledge, currently no data source provides an extensive collection of high resolution oncogenomic CNA data which readily could be used for genomic feature mining, across a representative range of cancer entities. arrayMap represents our effort for providing a long term platform for oncogenomic CNA data independent of specific platform considerations or specific project dependence. The online database can be accessed at http//
PMCID: PMC3356349  PMID: 22629346
15.  MIMAS 3.0 is a Multiomics Information Management and Annotation System 
BMC Bioinformatics  2009;10:151.
DNA sequence integrity, mRNA concentrations and protein-DNA interactions have been subject to genome-wide analyses based on microarrays with ever increasing efficiency and reliability over the past fifteen years. However, very recently novel technologies for Ultra High-Throughput DNA Sequencing (UHTS) have been harnessed to study these phenomena with unprecedented precision. As a consequence, the extensive bioinformatics environment available for array data management, analysis, interpretation and publication must be extended to include these novel sequencing data types.
MIMAS was originally conceived as a simple, convenient and local Microarray Information Management and Annotation System focused on GeneChips for expression profiling studies. MIMAS 3.0 enables users to manage data from high-density oligonucleotide SNP Chips, expression arrays (both 3'UTR and tiling) and promoter arrays, BeadArrays as well as UHTS data using MIAME-compliant standardized vocabulary. Importantly, researchers can export data in MAGE-TAB format and upload them to the EBI's ArrayExpress certified data repository using a one-step procedure.
We have vastly extended the capability of the system such that it processes the data output of six types of GeneChips (Affymetrix), two different BeadArrays for mRNA and miRNA (Illumina) and the Genome Analyzer (a popular Ultra-High Throughput DNA Sequencer, Illumina), without compromising on its flexibility and user-friendliness. MIMAS, appropriately renamed into Multiomics Information Management and Annotation System, is currently used by scientists working in approximately 50 academic laboratories and genomics platforms in Switzerland and France. MIMAS 3.0 is freely available via .
PMCID: PMC2694794  PMID: 19450266
16.  The DNA Data Bank of Japan launches a new resource, the DDBJ Omics Archive of functional genomics experiments 
Nucleic Acids Research  2011;40(Database issue):D38-D42.
The DNA Data Bank of Japan (DDBJ; maintains and provides archival, retrieval and analytical resources for biological information. The central DDBJ resource consists of public, open-access nucleotide sequence databases including raw sequence reads, assembly information and functional annotation. Database content is exchanged with EBI and NCBI within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). In 2011, DDBJ launched two new resources: the ‘DDBJ Omics Archive’ (DOR; and BioProject ( DOR is an archival database of functional genomics data generated by microarray and highly parallel new generation sequencers. Data are exchanged between the ArrayExpress at EBI and DOR in the common MAGE-TAB format. BioProject provides an organizational framework to access metadata about research projects and the data from the projects that are deposited into different databases. In this article, we describe major changes and improvements introduced to the DDBJ services, and the launch of two new resources: DOR and BioProject.
PMCID: PMC3244990  PMID: 22110025
17.  The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection 
Nucleic Acids Research  2010;39(Database issue):D1-D6.
The current 18th Database Issue of Nucleic Acids Research features descriptions of 96 new and 83 updated online databases covering various areas of molecular biology. It includes two editorials, one that discusses COMBREX, a new exciting project aimed at figuring out the functions of the ‘conserved hypothetical’ proteins, and one concerning BioDBcore, a proposed description of the ‘minimal information about a biological database’. Papers from the members of the International Nucleotide Sequence Database collaboration (INSDC) describe each of the participating databases, DDBJ, ENA and GenBank, principles of data exchange within the collaboration, and the recently established Sequence Read Archive. A testament to the longevity of databases, this issue includes updates on the RNA modification database, Definition of Secondary Structure of Proteins (DSSP) and Homology-derived Secondary Structure of Proteins (HSSP) databases, which have not been featured here in >12 years. There is also a block of papers describing recent progress in protein structure databases, such as Protein DataBank (PDB), PDB in Europe (PDBe), CATH, SUPERFAMILY and others, as well as databases on protein structure modeling, protein–protein interactions and the organization of inter-protein contact sites. Other highlights include updates of the popular gene expression databases, GEO and ArrayExpress, several cancer gene databases and a detailed description of the UK PubMed Central project. The Nucleic Acids Research online Database Collection, available at:, now lists 1330 carefully selected molecular biology databases. The full content of the Database Issue is freely available online at the Nucleic Acids Research web site (
PMCID: PMC3013748  PMID: 21177655
18.  Expression Profiler: next generation—an online platform for analysis of microarray data 
Nucleic Acids Research  2004;32(Web Server issue):W465-W470.
Expression Profiler (EP, is a web-based platform for microarray gene expression and other functional genomics-related data analysis. The new architecture, Expression Profiler: next generation (EP:NG), modularizes the original design and allows individual analysis-task-related components to be developed by different groups and yet still seamlessly to work together and share the same user interface look and feel. Data analysis components for gene expression data preprocessing, missing value imputation, filtering, clustering methods, visualization, significant gene finding, between group analysis and other statistical components are available from the EBI (European Bioinformatics Institute) web site. The web-based design of Expression Profiler supports data sharing and collaborative analysis in a secure environment. Developed tools are integrated with the microarray gene expression database ArrayExpress and form the exploratory analytical front-end to those data. EP:NG is an open-source project, encouraging broad distribution and further extensions from the scientific community.
PMCID: PMC441608  PMID: 15215431
19.  Cross-species queries of large gene expression databases 
Bioinformatics  2010;26(19):2416-2423.
Motivation: Expression databases, including the Gene Expression Omnibus and ArrayExpress, have experienced significant growth over the past decade and now hold hundreds of thousands of arrays from multiple species. Since most drugs are initially tested on model organisms, the ability to compare expression experiments across species may help identify pathways that are activated in a similar way in humans and other organisms. However, while several methods exist for finding co-expressed genes in the same species as a query gene, looking at co-expression of homologs or arbitrary genes in other species is challenging. Unlike sequence, which is static, expression is dynamic and changes between tissues, conditions and time. Thus, to carry out cross-species analysis using these databases, we need methods that can match experiments in one species with experiments in another species.
Results: To facilitate queries in large databases, we developed a new method for comparing expression experiments from different species. We define a distance metric between the ranking of orthologous genes in the two species. We show how to solve an optimization problem for learning the parameters of this function using a training dataset of known similar expression experiments pairs. The function we learn outperforms previous methods and simpler rank comparison methods that have been used in the past for single species analysis. We used our method to compare millions of array pairs from mouse and human expression experiments. The resulting matches can be used to find functionally related genes, to hypothesize about biological response mechanisms and to highlight conditions and diseases that are activating similar pathways in both species.
Availability: Supporting methods, results and a Matlab implementation are available from
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2944203  PMID: 20702396
20.  4DXpress: a database for cross-species expression pattern comparisons 
Nucleic Acids Research  2007;36(Database issue):D847-D853.
In the major animal model species like mouse, fish or fly, detailed spatial information on gene expression over time can be acquired through whole mount in situ hybridization experiments. In these species, expression patterns of many genes have been studied and data has been integrated into dedicated model organism databases like ZFIN for zebrafish, MEPD for medaka, BDGP for Drosophila or GXD for mouse. However, a central repository that allows users to query and compare gene expression patterns across different species has not yet been established. Therefore, we have integrated expression patterns for zebrafish, Drosophila, medaka and mouse into a central public repository called 4DXpress (expression database in four dimensions). Users can query anatomy ontology-based expression annotations across species and quickly jump from one gene to the orthologues in other species. Genes are linked to public microarray data in ArrayExpress. We have mapped developmental stages between the species to be able to compare developmental time phases. We store the largest collection of gene expression patterns available to date in an individual resource, reflecting 16 505 annotated genes. 4DXpress will be an invaluable tool for developmental as well as for computational biologists interested in gene regulation and evolution. 4DXpress is available at
PMCID: PMC2238840  PMID: 17916571
21.  Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers 
The ability to locate publicly available gene expression microarray datasets effectively and efficiently facilitates the reuse of these potentially valuable resources. Centralized biomedical databases allow users to query dataset metadata descriptions, but these annotations are often too sparse and diverse to allow complex and accurate queries. In this study we examined the ability of PubMed article identifiers to locate publicly available gene expression microarray datasets, and investigated whether the retrieved datasets were representative of publicly available datasets found through statements of data sharing in the associated research articles.
In a recent article, Ochsner and colleagues identified 397 studies that had generated gene expression microarray data. Their search of the full text of each publication for statements of data sharing revealed 203 publicly available datasets, including 179 in the Gene Expression Omnibus (GEO) or ArrayExpress databases. Our scripted search of GEO and ArrayExpress for PubMed identifiers of the same 397 studies returned 160 datasets, including six not found by the original search for data sharing statements. As a proportion of datasets found by either method, the search for data sharing statements identified 91.4% of the 209 publicly available datasets, compared to 76.6% found by our search for PubMed identifiers. Searching GEO or ArrayExpress alone retrieved 63.2% and 46.9% of all available datasets, respectively. Studies retrieved through PubMed identifiers were representative of all datasets in terms of research theme, technology, size, and impact, though the recall was highest for datasets published by the highest-impact journals.
Searching database entries using PubMed identifiers can identify the majority of publicly available datasets. We urge authors of all datasets to complete the citation fields for their dataset submissions once publication details are known, thereby ensuring their work has maximum visibility and can contribute to subsequent studies.
PMCID: PMC2990274  PMID: 20349403
information retrieval; data sharing; databases; bioinformatics; PubMed; gene expression microarrays
22.  A pipeline for RNA-seq data processing and quality assessment 
Bioinformatics  2011;27(6):867-869.
Summary: We present an R based pipeline, ArrayExpressHTS, for pre-processing, expression estimation and data quality assessment of high-throughput sequencing transcriptional profiling (RNA-seq) datasets. The pipeline starts from raw sequence files and produces standard Bioconductor R objects containing gene or transcript measurements for downstream analysis along with web reports for data quality assessment. It may be run locally on a user's own computer or remotely on a distributed R-cloud farm at the European Bioinformatics Institute. It can be used to analyse user's own datasets or public RNA-seq datasets from the ArrayExpress Archive.
Availability: The R package is available at with online documentation at, also available as supplementary material.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3051320  PMID: 21233166
23.  Probabilistic retrieval and visualization of biologically relevant microarray experiments 
Bioinformatics  2009;25(12):i145-i153.
Motivation: As ArrayExpress and other repositories of genome-wide experiments are reaching a mature size, it is becoming more meaningful to search for related experiments, given a particular study. We introduce methods that allow for the search to be based upon measurement data, instead of the more customary annotation data. The goal is to retrieve experiments in which the same biological processes are activated. This can be due either to experiments targeting the same biological question, or to as yet unknown relationships.
Results: We use a combination of existing and new probabilistic machine learning techniques to extract information about the biological processes differentially activated in each experiment, to retrieve earlier experiments where the same processes are activated and to visualize and interpret the retrieval results. Case studies on a subset of ArrayExpress show that, with a sufficient amount of data, our method indeed finds experiments relevant to particular biological questions. Results can be interpreted in terms of biological processes using the visualization techniques.
Availability: The code is available from
PMCID: PMC2687969  PMID: 19477980
24.  AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments 
The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis—clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Code is available for download at
Database URL:
PMCID: PMC3244265  PMID: 22190598
25.  MIMAS: an innovative tool for network-based high density oligonucleotide microarray data management and annotation 
BMC Bioinformatics  2006;7:190.
The high-density oligonucleotide microarray (GeneChip) is an important tool for molecular biological research aiming at large-scale detection of small nucleotide polymorphisms in DNA and genome-wide analysis of mRNA concentrations. Local array data management solutions are instrumental for efficient processing of the results and for subsequent uploading of data and annotations to a global certified data repository at the EBI (ArrayExpress) or the NCBI (GeneOmnibus).
To facilitate and accelerate annotation of high-throughput expression profiling experiments, the Microarray Information Management and Annotation System (MIMAS) was developed. The system is fully compliant with the Minimal Information About a Microarray Experiment (MIAME) convention. MIMAS provides life scientists with a highly flexible and focused GeneChip data storage and annotation platform essential for subsequent analysis and interpretation of experimental results with clustering and mining tools. The system software can be downloaded for academic use upon request.
MIMAS implements a novel concept for nation-wide GeneChip data management whereby a network of facilities is centered on one data node directly connected to the European certified public microarray data repository located at the EBI. The solution proposed may serve as a prototype approach to array data management between research institutes organized in a consortium.
PMCID: PMC1459208  PMID: 16597336

Results 1-25 (1104729)