The ArrayExpress Archive of Functional Genomics Data is one of the major international repositories for functional genomics high-throughput data. Since 2003 (1
), the database has grown to ~15
000 experiments comprised of ~425
000 assays. During this period, the technology used to generate functional genomics data has changed from microarray-based experiments to high-throughput sequencing. To address this, we have developed and integrated submissions of high-throughput sequencing data with the European Genome-phenome Archive (EGA) (Lappalainen,I. et al
., submitted) and the European Nucleotide Archive (ENA) (2
). Other important developments are the inclusion of all Gene Expression Omnibus (GEO) array-based data and a new data exchange agreement with the GEO (3
) for high-throughput sequencing data, a new advanced query capability supporting ontology-based queries over the entire Archive contents. The European Bioinformatics Institute's Gene Expression Atlas (GXA) (4
) is now a separate resource from the Archive and is linked from the ArrayExpress Graphical User Interface.
Support for high-throughput sequencing data
Adjusting ArrayExpress to accept and display high-throughput sequencing experiments alongside existing array data is one of the major recent developments. We have worked closely with other resources at European Bioinformatics Institute, specifically the ENA and EGA, who archive short-read data for multi-species and potential human identifiable data, respectively. As outlined in MINSEQE guidelines (Minimum Information about a high throughput Sequencing Experiment, http://www.mged.org/minseqe
), the provision of raw sequence data is insufficient to describe comparative experiments such as RNA-Seq; metadata describing the experimental conditions and processed data are necessary to interpret the experiment. There are parallels to the provision of metadata for microarray-based experiments (in addition to the raw data files, e.g. CEL files); therefore, the MAGE-TAB (5
) data representation format is both an appropriate and an easy to use format for describing these experiments.
Submission of high-throughput sequencing data are now supported by the MAGE-TAB template generation system (6
). This allows users to generate and complete a taxon-specific tab-delimited template which describes their experiment and to supply related data files by FTP or Aspera. Where raw data are available these are integrated into the ENA at the point of submission, and both ArrayExpress experiment accessions and ENA identifiers for sequences are returned to the user by ArrayExpress curators. Exceptions to this process are submissions with human data which are potentially identifiable, e.g. sequence of human patients. These data are submitted direct to the EGA in MAGE-TAB format, raw data are retained by the EGA and summary-level data which meet ethical requirements for release are released to ArrayExpress.
Sequencing-based experiments in ArrayExpress now have clickable links from the user interface to the ENA sequence archive to raw data files, and links are also provided in the MAGE-TAB. Work is in progress to develop an automated BioConductor (7
) package to identify, extract and reprocess RNA-Seq data for inclusion in the GXA.
Advanced queries and ontology-driven searches
ArrayExpress provides rich metadata for samples and experiments, these are typically provided as free-text name value pairs, e.g. disease state, invasive ductal carcinoma. To enable semantic queries (for instance, to find all cancer-related data sets even if they were not annotated as ‘cancer’, but e.g. ‘leukemia’), we have developed open source software that allows for query expansion based on the Experiment Factor Ontology (EFO) (8
). EFO is a data-driven application ontology developed to describe the sample attributes and experimental variables in functional genomics data sets. The new advanced query syntax allows logical, range and ontology-supported queries. For example, ‘retrieve all experiments where one or more samples is annotated as cancer, or a subtype of cancer’ returns 21
083 assays, without the ontology support and 49
729 assays using subsumption queries for known subtypes of cancer. The query results are visualized with yellow matching original input, green matching synonyms and red matching child terms. The ontology is visualized as a tree on query and users are provided with autocomplete options based on its content. Additionally, the interface has been modified so that experiments can be queried by assay types (array/high-throughput sequencing), source (GEO/ArrayExpress) and molecule (DNA/RNA).
Integration of GEO data
ArrayExpress has been importing selected GEO (3
) Data sets (GDS) in order to provide unified queries across public data and for integration with European Bioinformatics Institute databases such as Ensembl (9
). All GEO data with GDS and GSE prefixes are now being imported into ArrayExpress. To date more than 12
000 GEO-derived experiments and associated array designs are available, import of all GEO data will be complete by the end of 2010. Selected GDS are re-annotated, subjected to quality control and integrated into the GXA. A data exchange agreement between GEO and ArrayExpress is now in place for high-throughput sequencing data and all HTP sequencing data submitted to GEO are present in ArrayExpress.
ArrayExpress will be closely integrated with a new BioSample Database at the European Bioinformatics Institute (EBI) (http://www.ebi.ac.uk/biosamples
). This database will store the sample descriptions for all the samples referenced by any of the databases. Samples can be pre-submitted and will be linked to EBI databases where related data exist. For example, 1000 genomes, Coriell cell lines or HapMap samples have records in the ENA, EGA and ArrayExpress. This new resource is being developed in conjunction with the NCBI and data exchange is planned.
The replacement of existing MAGE-OM centric architecture (11
) with MAGE-TAB-based infrastructure is ongoing and data migration is underway. This effort will significantly simplify all internal data management tasks and will benefit the users in improved data load times, faster issuing of accession numbers, faster data exchange with GEO and improved query interfaces. The existing browse user interface will be maintained, as will current programmatic access and FTP site structure to ensure minimum disruption for users.