During the last decade, the gene expression microarrays have become a standard tool in studying a large variety of biological questions [1
]. Beginning from the first experiments [2
], microarrays have been used for pinpointing disease-specific genes and drug targets [3
], uncovering signaling networks [5
], describing cellular processes [6
], among many other applications. While the methods for single experiment analysis are well established and popular [7
], it is clear that information extracted from a single experiment is constrained by details of experimental design such as conditions and cell types. Integrating data from different experiments widens the spectrum of biological conditions and increases the power to find subtler effects.
Coexpression is one of the central ideas in gene expression analysis. The 'Guilt by association' principle states that gene coexpression might indicate shared regulatory mechanisms and roles in related biological processes. The validity of the principle is proved in several studies, see for example [8
]. The idea can be applied in many tasks of computational biology, such as inferring functions to poorly characterized genes [9
], discovering new putative members for metabolic pathways [12
], or predicting and validating of protein-protein interactions [13
]. Many de novo
regulatory motif discovery methods use gene expression similarity information as a primary input for identifying co-regulated genes [15
]. More recently, gene expression similarity search has been utilized in a pathway reconstruction study [17
Multi-experiment coexpression analysis can be a labour-intensive and computationally challenging task. First steps involve collecting suitable datasets, data downloads, preprocessing, normalization, and gene annotation management. Then, methodological and technical questions arise, namely the integration of different datasets, merging cross-platform data, and handling ambiguous mappings between genes and probesets. Finally, the sheer size of targeted data requires efficient computational strategies or caching of pre-calculated results. The complexity of multi-experiment microarray analysis is likely its main limitation, as researchers often lack the time and resource to take on such a task. Consequently, there is a clear need for services that provide coexpression information in an easy and accessible format.
Surprisingly, the resources and tools for finding genes with similar expression profiles in multiple experiments are still rather scarce.
Microarray databases ArrayExpress [18
] and Gene Expression Omnibus (GEO) [19
] have implemented a data mining layer for finding and analyzing most relevant datasets, but neither yet provides a comprehensive gene coexpression search over many datasets simultaneously. Gemma is a web based resource that utilizes a global inference strategy to detect genes that have similar expression profiles in all covered datasets [20
]. However, global coexpression analysis is likely to miss similarities that occur in a tissue or condition specific manner [21
]. SPELL is a resource that puts a strong emphasis on selecting the appropriate datasets for the query [22
]. The method identifies the subset of most relevant datasets by analyzing the coexpression of a user-defined list of genes, and uses the subset to find additional genes. Unfortunately, detecting relevant datasets relies on the user's knowledge of genes that are likely to have similar expression profiles. Furthermore, it currently features relatively small number of datasets, all of them describing yeast.
We have developed the query engine MEM that detects coexpressed genes in large platform-specific microarray collections. The Affymetrix microarray data originates from ArrayExpress and also includes datasets submitted to GEO and automatically uploaded to ArrayExpress. MEM encompasses a variety of conditions, tissues and disease states and incorporates nearly a thousand datasets for both human and mouse, as well as hundreds of datasets for other model organisms.
MEM coexpression search requires two types of input: first, the user types in a gene ID of interest, and second, chooses a collection of relevant datasets. The user may pick the datasets manually by browsing their annotations, or allow MEM to make an automatic selection based on statistical criteria such as gene variability. MEM performs the coexpression analysis individually for each dataset and assembles the final list of similar genes using a novel statistical rank aggregation algorithm. Efficient programming guarantees rapid performance of the computationally intensive real-time analysis that does not rely on precomputed or indexed data. The results are presented in highly interactive graphical format with strong emphasis on further data mining. Query results and datasets can be ordered by significance or clustered. The MEM visualization method helps highlights datasets with highest coexpression to input gene and helps the user distinguish evidence with poor or negative correlation. Datasets are additionally characterized with automatic text analysis of experiment descriptions, and represented as word clouds that highlight predominant terms. With MEM we aim to make multi-experiment coexpression analysis accessible to a wider community of researchers.