The extraction of information from data generated by high-throughput experiments in genomics and proteomics has been likened to "attempting to drink from a fire hose". We are flooded with information on many levels such as whole genome DNA sequences, RNA expression, protein-protein interactions, protein modifications, and more. All this information is accessible in very different formats, ranging from well-organized curated gene sequences to unstructured free text in scientific literature. A system that can manage, link and query these heterogeneous types of datasets is therefore extremely valuable. The Sequence Retrieval System (SRS) is such a unified database system in which numerous different scientific databases have already been integrated [1
Of special interest are data from high-throughput RNA expression microarrays [2
]. Many of these datasets are freely available and, like information stored in other scientific databases, are from different platforms [4
]. Integrating and mining these databases strongly facilitates the analysis of genes of interest but will also support discovery of disease markers, drug-targets and new knowledge in general [6
]. One such platform is Oncomine, which has integrated many different microarray datasets, focussing on human cancer [10
]. Additionally, standardized microarray depositories such as GEO (Gene Expression Omnibus) [11
], ArrayExpress [12
], and CIBEX [13
] do or will soon provide options to browse and query the datasets [14
]. No doubt, other platforms will be developed focussing on the integration of microarray data. If started from scratch, these initiatives will likely be limited in their direct linkage to other heterogeneous biological databases due to the laborious task of making those connections and programming the single and batch-wise query options. The universality and the availability of numerous scientific databases that have already been integrated in SRS make it a useful platform for integrating microarray databases. Although the SRS interface to query databases is quite user friendly, other aspects of working with SRS are not. These include (i) uploading microarray datasets, (ii) database security including setting user access, (iii) linking databases, (iv) generating standard views, and (v) communication with other programs such as statistical and clustering software. The current SRS interface has a major disadvantage in that it is not designed to perform complex calculations on the fly. This means that any microarray dataset to be uploaded must have all ratio and statistical calculations performed upfront. For example, once in SRS, one cannot change ratios from log10 to log2 or add an extra field per gene by dividing expression data of all "normal" by all "cancer" samples. However, software programs that perform calculations, statistical evaluations, clustering, protein domain predictions, homology searches, and more, can communicate with SRS. Interfaces can be generated that retrieve data from SRS, perform the required action and if desired, store the results in SRS. Alternatively, SRS allows direct integration of programs such as the BLAST and FASTA homology searches and the SRS-EMBOSS (European Molecular Biology Open Software Suite) tools [18
Generating a database in which heterogeneous datasets are integrated is a challenge in itself. However, retrieving statistically meaningful data by comparing datasets from different sources, platforms and designs is particularly difficult [20
]. There is a fast growing body of publications on microarray cross-platform comparisons, mainly showing how this can be achieved in very many different ways [8
]. Statistical evaluations of data within a dataset of sufficient technical and biological replicates, are better defined and can be implemented per dataset within a database system [27
]. The strategies and applications we discuss here to link, store and query scientific datasets in SRS, do not go beyond processed individual datasets and do not include cross-platform dataset integrations. We assume that each uploaded dataset consists of high-quality data and has been processed correctly.
In this paper we describe strategies to incorporate microarray databases into SRS and provide a database upload tool. Using the program Venn Mapper as an example, we show the possibility to automatically retrieve the stored microarray data from SRS for external statistical evaluation.