The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is one of three international functional genomics public data repositories, alongside the Gene Expression Omnibus at NCBI and the DDBJ Omics Archive, supporting peer-reviewed publications. It accepts data generated by sequencing or array-based technologies and currently contains data from almost a million assays, from over 30 000 experiments. The proportion of sequencing-based submissions has grown significantly over the last 2 years and has reached, in 2012, 15% of all new data. All data are available from ArrayExpress in MAGE-TAB format, which allows robust linking to data analysis and visualization tools, including Bioconductor and GenomeSpace. Additionally, R objects, for microarray data, and binary alignment format files, for sequencing data, have been generated for a significant proportion of ArrayExpress data.
Autoimmune diseases are common and debilitating, but their severe manifestations could be reduced if biomarkers were available to allow individual tailoring of the potentially toxic immunosuppressive therapy required for their control. Gene expression-based biomarkers facilitating individual tailoring of chemotherapy in cancer, but not autoimmunity, have been identified and translated into clinical practice1,2. We show that transcriptional profiling of purified CD8 T cells, which avoids the confounding influences of unseparated cells3,4, identifies two distinct patient subgroups predicting long-term prognosis in two different autoimmune diseases, anti-neutrophil cytoplasmic antibody (ANCA) – associated vasculitis (AAV), a chronic, severe disease characterized by inflammation of medium and small blood vessels5, and systemic lupus erythematosus (SLE), characterized by autoantibodies, immune complex deposition and diverse clinical manifestations ranging from glomerulonephritis to neurological dysfunction6. We show that genes defining the poor prognostic group are enriched for genes of the IL7R pathway, TCR signalling and those expressed by memory T cells. Furthermore, the poor prognostic group is associated with an expanded CD8 T cell memory population. These subgroups, which are also found in the normal population and can be identified by measuring expression of only three genes, raise the prospect of individualized therapy and suggest novel potential therapeutic targets in autoimmunity.
Motivation: Meta-analysis of large gene expression datasets obtained from public repositories requires consistently annotated data. Curation of such experiments, however, is an expert activity which involves repetitive manipulation of text. Existing tools for automated curation are few, which bottleneck the analysis pipeline.
Results: We present MageComet, a web application for biologists and annotators that facilitates the re-annotation of gene expression experiments in MAGE-TAB format. It incorporates data mining, automatic annotation, use of ontologies and data validation to improve the consistency and quality of experimental meta-data from the ArrayExpress Repository.
Availability and implementation: Source and tutorials for MageComet are openly available at goo.gl/8LQPR under the GNU GPL v3 licenses. An implementation can be found at goo.gl/IdCuA
firstname.lastname@example.org or email@example.com
RNA polymerase III (pol III) transcription of transfer RNA (tRNA) genes is essential for generating the tRNA adapter molecules that link genetic sequence and protein translation. By mapping pol III occupancy genome-wide in the livers of mouse, rat, human, macaque, dog and opossum, we found that pol III binding to individual tRNA genes varies substantially in strength and location. However, taking into account tRNA redundancies by grouping pol III occupancy into 46 anticodon isoacceptor families or 21 amino acid-based isotype classes shows strong conservation. Similarly, pol III occupancy of amino-acid isotypes is almost invariant among transcriptionally and evolutionarily diverse tissues in mouse. Thus, synthesis of functional tRNA isotypes has been highly constrained, though the usage of individual tRNA genes has evolved rapidly.
Motivation: Genome-wide measurement of transcript levels is an ubiquitous tool in biomedical research. As experimental data continues to be deposited in public databases, it is becoming important to develop search engines that enable the retrieval of relevant studies given a query study. While retrieval systems based on meta-data already exist, data-driven approaches that retrieve studies based on similarities in the expression data itself have a greater potential of uncovering novel biological insights.
Results: We propose an information retrieval method based on differential expression. Our method deals with arbitrary experimental designs and performs competitively with alternative approaches, while making the search results interpretable in terms of differential expression patterns. We show that our model yields meaningful connections between biological conditions from different studies. Finally, we validate a previously unknown connection between malignant pleural mesothelioma and SIM2s suggested by our method, via real-time polymerase chain reaction in an independent set of mesothelioma samples.
Availability: Supplementary data and source code are available from http://www.ebi.ac.uk/fg/research/rex.
Supplementary Information: Supplementary data are available at Bioinformatics online.
The BioSample Database (http://www.ebi.ac.uk/biosamples) is a new database at EBI that stores information about biological samples used in molecular experiments, such as sequencing, gene expression or proteomics. The goals of the BioSample Database include: (i) recording and linking of sample information consistently within EBI databases such as ENA, ArrayExpress and PRIDE; (ii) minimizing data entry efforts for EBI database submitters by enabling submitting sample descriptions once and referencing them later in data submissions to assay databases and (iii) supporting cross database queries by sample characteristics. Each sample in the database is assigned an accession number. The database includes a growing set of reference samples, such as cell lines, which are repeatedly used in experiments and can be easily referenced from any database by their accession numbers. Accession numbers for the reference samples will be exchanged with a similar database at NCBI. The samples in the database can be queried by their attributes, such as sample types, disease names or sample providers. A simple tab-delimited format facilitates submissions of sample information to the database, initially via email to firstname.lastname@example.org
Gene Expression Atlas (http://www.ebi.ac.uk/gxa) is an added-value database providing information about gene expression in different cell types, organism parts, developmental stages, disease states, sample treatments and other biological/experimental conditions. The content of this database derives from curation, re-annotation and statistical analysis of selected data from the ArrayExpress Archive and the European Nucleotide Archive. A simple interface allows the user to query for differential gene expression either by gene names or attributes or by biological conditions, e.g. diseases, organism parts or cell types. Since our previous report we made 20 monthly releases and, as of Release 11.08 (August 2011), the database supports 19 species, which contains expression data measured for 19 014 biological conditions in 136 551 assays from 5598 independent studies.
Summary: We present an R based pipeline, ArrayExpressHTS, for pre-processing, expression estimation and data quality assessment of high-throughput sequencing transcriptional profiling (RNA-seq) datasets. The pipeline starts from raw sequence files and produces standard Bioconductor R objects containing gene or transcript measurements for downstream analysis along with web reports for data quality assessment. It may be run locally on a user's own computer or remotely on a distributed R-cloud farm at the European Bioinformatics Institute. It can be used to analyse user's own datasets or public RNA-seq datasets from the ArrayExpress Archive.
Availability: The R package is available at www.ebi.ac.uk/tools/rcloud with online documentation at www.ebi.ac.uk/Tools/rwiki/, also available as supplementary material.
Supplementary information: Supplementary data are available at Bioinformatics online.
It is widely accepted that orthologous genes between species are conserved at the sequence level and perform similar functions in different organisms. However, the level of conservation of gene expression patterns of the orthologous genes in different species has been unclear. To address the issue, we compared gene expression of orthologous genes based on 2,557 human and 1,267 mouse samples with high quality gene expression data, selected from experiments stored in the public microarray repository ArrayExpress.
In a principal component analysis (PCA) of combined data from human and mouse samples merged on orthologous probesets, samples largely form distinctive clusters based on their tissue sources when projected onto the top principal components. The most prominent groups are the nervous system, muscle/heart tissues, liver and cell lines. Despite the great differences in sample characteristics and experiment conditions, the overall patterns of these prominent clusters are strikingly similar for human and mouse. We further analyzed data for each tissue separately and found that the most variable genes in each tissue are highly enriched with human-mouse tissue-specific orthologs and the least variable genes in each tissue are enriched with human-mouse housekeeping orthologs.
The results indicate that the global patterns of tissue-specific expression of orthologous genes are conserved in human and mouse. The expression of groups of orthologous genes co-varies in the two species, both for the most variable genes and the most ubiquitously expressed genes.
Summary: The Sample avAILability system—SAIL—is a web based application for searching, browsing and annotating biological sample collections or biobank entries. By providing individual-level information on the availability of specific data types (phenotypes, genetic or genomic data) and samples within a collection, rather than the actual measurement data, resource integration can be facilitated. A flexible data structure enables the collection owners to provide descriptive information on their samples using existing or custom vocabularies. Users can query for the available samples by various parameters combining them via logical expressions. The system can be scaled to hold data from millions of samples with thousands of variables.
Availability: SAIL is available under Aferro-GPL open source license: https://github.com/sail.
Contact: email@example.com, firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online and from http://www.simbioms.org.
The ArrayExpress Archive (http://www.ebi.ac.uk/arrayexpress) is one of the three international public repositories of functional genomics data supporting publications. It includes data generated by sequencing or array-based technologies. Data are submitted by users and imported directly from the NCBI Gene Expression Omnibus. The ArrayExpress Archive is closely integrated with the Gene Expression Atlas and the sequence databases at the European Bioinformatics Institute. Advanced queries provided via ontology enabled interfaces include queries based on technology and sample attributes such as disease, cell types and anatomy.
The main conclusion is that systems biology approaches can indeed advance cancer research, having already proved successful in a very wide variety of cancer-related areas, and are likely to prove superior to many current research strategies. Major points include:
Systems biology and computational approaches can make important contributions to research and development in key clinical aspects of cancer and of cancer treatment, and should be developed for understanding and application to diagnosis, biomarkers, cancer progression, drug development and treatment strategies.Development of new measurement technologies is central to successful systems approaches, and should be strongly encouraged. The systems view of disease combined with these new technologies and novel computational tools will over the next 5–20 years lead to medicine that is predictive, personalized, preventive and participatory (P4 medicine).Major initiatives are in progress to gather extremely wide ranges of data for both somatic and germ-line genetic variations, as well as gene, transcript, protein and metabolite expression profiles that are cancer-relevant. Electronic databases and repositories play a central role to store and analyze these data. These resources need to be developed and sustained.Understanding cellular pathways is crucial in cancer research, and these pathways need to be considered in the context of the progression of cancer at various stages. At all stages of cancer progression, major areas require modelling via systems and developmental biology methods including immune system reactions, angiogenesis and tumour progression.A number of mathematical models of an analytical or computational nature have been developed that can give detailed insights into the dynamics of cancer-relevant systems. These models should be further integrated across multiple levels of biological organization in conjunction with analysis of laboratory and clinical data.Biomarkers represent major tools in determining the presence of cancer, its progression and the responses to treatments. There is a need for sets of high-quality annotated clinical samples, enabling comparisons across different diseases and the quantitative simulation of major pathways leading to biomarker development and analysis of drug effects.Education is recognized as a key component in the success of any systems biology programme, especially for applications to cancer research. It is recognized that a balance needs to be found between the need to be interdisciplinary and the necessity of having extensive specialist knowledge in particular areas.A proposal from this workshop is to explore one or more types of cancer over the full scale of their progression, for example glioblastoma or colon cancer. Such an exemplar project would require all the experimental and computational tools available for the generation and analysis of quantitative data over the entire hierarchy of biological information. These tools and approaches could be mobilized to understand, detect and treat cancerous processes and establish methods applicable across a wide range of cancers.
Systems biology; EU-USA workshop; Cancer
Summary: Computational methods in molecular biology will increasingly depend on standards-based annotations that describe biological experiments in an unambiguous manner. Annotare is a software tool that enables biologists to easily annotate their high-throughput experiments, biomaterials and data in a standards-compliant way that facilitates meaningful search and analysis.
Availability and Implementation: Annotare is available from http://code.google.com/p/annotare/ under the terms of the open-source MIT License (http://www.opensource.org/licenses/mit-license.php). It has been tested on both Mac and Windows.
Motivation: Describing biological sample variables with ontologies is complex due to the cross-domain nature of experiments. Ontologies provide annotation solutions; however, for cross-domain investigations, multiple ontologies are needed to represent the data. These are subject to rapid change, are often not interoperable and present complexities that are a barrier to biological resource users.
Results: We present the Experimental Factor Ontology, designed to meet cross-domain, application focused use cases for gene expression data. We describe our methodology and open source tools used to create the ontology. These include tools for creating ontology mappings, ontology views, detecting ontology changes and using ontologies in interfaces to enhance querying. The application of reference ontologies to data is a key problem, and this work presents guidelines on how community ontologies can be presented in an application ontology in a data-driven way.
Supplementary information: Supplementary data are available at Bioinformatics online.
Gene expression studies greatly contribute to our understanding of complex relationships in gene regulatory networks. However, the complexity of array design, production and manipulations are limiting factors, affecting data quality. The use of customized DNA microarrays improves overall data quality in many situations, however, only if for these specifically designed microarrays analysis tools are available.
The IronChip Evaluation Package (ICEP) is a collection of Perl utilities and an easy to use data evaluation pipeline for the analysis of microarray data with a focus on data quality of custom-designed microarrays. The package has been developed for the statistical and bioinformatical analysis of the custom cDNA microarray IronChip but can be easily adapted for other cDNA or oligonucleotide-based designed microarray platforms. ICEP uses decision tree-based algorithms to assign quality flags and performs robust analysis based on chip design properties regarding multiple repetitions, ratio cut-off, background and negative controls.
ICEP is a stand-alone Windows application to obtain optimal data quality from custom-designed microarrays and is freely available here (see "Additional Files" section) and at: http://www.alice-dsl.net/evgeniy.vainshtein/ICEP/
Finding transcription factor binding sites in regulatory regions of the genome
With genome analysis expanding from the study of genes to the study of gene regulation, 'regulatory genomics' utilizes sequence information, evolution and functional genomics measurements to unravel how regulatory information is encoded in the genome.
The Gene Expression Atlas (http://www.ebi.ac.uk/gxa) is an added-value database providing information about gene expression in different cell types, organism parts, developmental stages, disease states, sample treatments and other biological/experimental conditions. The content of this database derives from curation, re-annotation and statistical analysis of selected data from the ArrayExpress Archive of Functional Genomics Data. A simple interface allows the user to query for differential gene expression either (i) by gene names or attributes such as Gene Ontology terms, or (ii) by biological conditions, e.g. diseases, organism parts or cell types. The gene queries return the conditions where expression has been reported, while condition queries return which genes are reported to be expressed in these conditions. A combination of both query types is possible. The query results are ranked using various statistical measures and by how many independent studies in the database show the particular gene-condition association. Currently, the database contains information about more than 200 000 genes from nine species and almost 4500 biological conditions studied in over 30 000 assays from over 1000 independent studies.
The Minimum Information for Biological and Biomedical Investigations (MIBBI) project provides a resource for those exploring the range of extant minimum information checklists and fosters coordinated development of such checklists.
The regulation of the G1- to S-phase transition is critical for cell-cycle progression. This transition is driven by a transient transcriptional wave regulated by transcription factor complexes termed MBF/SBF in yeast and E2F-DP in mammals. Here we apply genomic, genetic, and biochemical approaches to show that the Yox1p homeodomain protein of fission yeast plays a critical role in confining MBF-dependent transcription to the G1/S transition of the cell cycle. The yox1 gene is an MBF target, and Yox1p accumulates and preferentially binds to MBF-regulated promoters, via the MBF components Res2p and Nrm1p, when they are transcriptionally repressed during the cell cycle. Deletion of yox1 results in constitutively high transcription of MBF target genes and loss of their cell cycle–regulated expression, similar to deletion of nrm1. Genome-wide location analyses of Yox1p and the MBF component Cdc10p reveal dozens of genes whose promoters are bound by both factors, including their own genes and histone genes. In addition, Cdc10p shows promiscuous binding to other sites, most notably close to replication origins. This study establishes Yox1p as a new regulatory MBF component in fission yeast, which is transcriptionally induced by MBF and in turn inhibits MBF-dependent transcription. Yox1p may function together with Nrm1p to confine MBF-dependent transcription to the G1/S transition of the cell cycle via negative feedback. Compared to the orthologous budding yeast Yox1p, which indirectly functions in a negative feedback loop for cell-cycle transcription, similarities but also notable differences in the wiring of the regulatory circuits are evident.
Cells proliferate by growth and division, which is supported by different gene groups that are periodically induced at specific times when they are required during the cell cycle. These genes not only need to be induced at the right time but also repressed when they are no longer required; mistakes in gene regulation can lead to problems in cell proliferation and diseases such as cancer. A well-known regulatory complex functions just before cells replicate their DNA to induce genes required for this important transition. We show that in fission yeast this regulatory complex (MBF) induces a gene whose encoded protein (Yox1p) in turn binds to MBF and represses MBF-regulated genes. In the absence of Yox1p, the MBF-regulated genes do not fluctuate during the cell cycle but remain constantly induced. Thus, MBF sets up not only the induction but also the timely repression of its target genes via Yox1p. We also provide a global analysis of all the genes regulated by Yox1p and MBF. Together, our data uncover a new negative control loop, further highlighting the sophistication of gene regulation during the cell cycle, and illustrating regulatory similarities and differences between organisms.
Summary: SIMBioMS is a web-based open source software system for managing data and information in biomedical studies. It provides a solution for the collection, storage, management and retrieval of information about research subjects and biomedical samples, as well as experimental data obtained using a range of high-throughput technologies, including gene expression, genotyping, proteomics and metabonomics. The system can easily be customized and has proven to be successful in several large-scale multi-site collaborative projects. It is compatible with emerging functional genomics data standards and provides data import and export in accepted standard formats. Protocols for transferring data to durable archives at the European Bioinformatics Institute have been implemented.
Availability: The source code, documentation and initialization scripts are available at http://simbioms.org.
Contact: email@example.com; firstname.lastname@example.org
Summary:ArrayExpress is one of the largest public repositories of microarray datasets. R/Bioconductor provides a comprehensive suite of microarray analysis and integrative bioinformatics software. However, easy ways for importing datasets from ArrayExpress into R/Bioconductor have been lacking. Here, we present such a tool that is suitable for both interactive and automated use.
Availability: The ArrayExpress package is available from the Bioconductor project at http://www.bioconductor.org. A users guide and examples are provided with the package.
Supplementary information:Supplementary data are available Bioinformatics online.
Motivation: As ArrayExpress and other repositories of genome-wide experiments are reaching a mature size, it is becoming more meaningful to search for related experiments, given a particular study. We introduce methods that allow for the search to be based upon measurement data, instead of the more customary annotation data. The goal is to retrieve experiments in which the same biological processes are activated. This can be due either to experiments targeting the same biological question, or to as yet unknown relationships.
Results: We use a combination of existing and new probabilistic machine learning techniques to extract information about the biological processes differentially activated in each experiment, to retrieve earlier experiments where the same processes are activated and to visualize and interpret the retrieval results. Case studies on a subset of ArrayExpress show that, with a sufficient amount of data, our method indeed finds experiments relevant to particular biological questions. Results can be interpreted in terms of biological processes using the visualization techniques.
Availability: The code is available from http://www.cis.hut.fi/projects/mi/software/ismb09.