The ArrayExpress Archive of Functional Genomics Data (1
) is one of the major international repositories for functional genomics high throughput data, supporting publications as well as various data generating consortia. It stores functional genomics data derived from high throughput sequencing (HTS) and microarray-based experiments. Users come to ArrayExpress to (i) find functional genomics experiments that might be relevant to their research; (ii) retrieve information describing these experiments and the data associated with them; (iii) retrieve data for including in their own local data warehouses or added value databases; and (iv) submit their own data supporting a peer-reviewed publication.
Once submitted, data may be kept in ArrayExpress as private for a limited period of time, typically during the peer-review process of the related publication. Upon submission, an accession number is assigned to it and access to the data is restricted to providers/reviewers via a login system. The submitter specifies the release date and the data becomes public either when the accession number associated with the data is cited in a publication or at the set release date, whichever comes first.
All submissions are automatically checked for compliance to the Minimum Information About a Microarray Experiments (MIAME) (2
) or Minimum Information about Sequencing Experiments (MINSEQE – http://www.fged.org/projects/minseqe/
) guidelines, for microarray and sequencing-based experiments, respectively. The MIAME/MINSEQE scores associated with an experiment are displayed in the ArrayExpress interface and provided to submitters.
In addition to the data submitted directly to ArrayExpress, data from the Gene Expression Omnibus (GEO) (3
) are imported to provide users with a single access to most of the functional genomics data available in the public domain. All data are organized, and available for download, in a structured and standardized format, MAGE-TAB (4
), which also facilitates linking to open source analysis environments such as Bioconductor (5
) and GenomeSpace (http://www.genomespace.org
). A format conversion tool, from GEO SOFT to MAGE-TAB (6
), is run on all GEO HTS and microarray data. The conversion is successful in 83% of cases; there are various reasons why this conversion may fail, including failure to parse SOFT files correctly or failure to retrieve the associated data files and we are constantly working with GEO to increase the success rate. All HTS data are exchanged with GEO and a data sharing agreement with the DDBJ Omics Archive is also in place (7
For all experiments, the column labels describing the sample (e.g. disease) and its characteristics (e.g. type II diabetes) are mapped to the EBI's Experimental Factor Ontology (EFO) (8
) and the data loaded into ArrayExpress. This allows consistent query results to be returned from direct submissions as well as imported data. As data are curated for Gene Expression Atlas use (9
), they are reloaded into ArrayExpress with enriched annotation.
The ArrayExpress user interface allows users to search for experiments of interest by keywords and ontology terms, which enable semantically driven searches of the experimental metadata; for instance searching with the EFO term ‘cancer’ will also find experiments investigating ‘leukemia’ even if ‘cancer’ is not mentioned explicitly. Both US and UK spelling is supported.