DNA microarrays have become the most important source of experimental genomic information that are applied in a large scale. They are widely used for tissue/disease classification as well as gene function discovery. Applications of this technology are routinely and widely published within almost all aspects of biology and human disease studies, with more than 14,000 PubMed citations containing the word 'microarray' published between 1996 and 2007. Even in the early years of microarray experimentation, it was widely recognized that a central repository of this information should be created to house these data. This enables potentially important additional information to be gleaned by re-interpretation by other researchers, perhaps in different contexts or in relation to new data. Thus, major efforts to house such data were made, namely the Gene Expression Omnibus (GEO) [1
] and ArrayExpress (AEX) [2
]. These repositories contain more than 82,000 and 50,000 microarray hybridizations of data, respectively. Primary data are expensive and time consuming to generate. In spite of the high cost, such experiments are rarely fully mined for their information content. Indeed, several meta-analyses have been reported that were based on archived data [3
]. These studies demonstrate the benefit of data repositories and that additional inferences are possible with reanalysis.
Although gene expression microarray technology has been implemented in a variety of formats (spotted cDNAs, spotted column-synthesized oligos, and in situ synthesized oligos), the leading commercial supplier of microarrays has been Affymetrix Inc. (Santa Clara, CA, USA) since 1996. Within the GEO repository Affymetrix platforms account for 35% of all arrays deposited, but they represent approximately 60% of the genome-scale gene expression data. For instance, Affymetrix platform arrays account for the top seven array platforms in terms of the number of arrays deposited in GEO. Thus, in the public domain within repositories, this platform type forms the richest set of expression information that can be most readily combined in a useful manner for meta-analyses spanning multiple experiments. Furthermore, the Affymetrix platform has a standard set of protocols for probe generation and labeling, uses a single color detection system, and has a relatively reliable array fabrication process. The Affymetrix platform is widely applied to a variety of biologic problems. Thus, this platform is highly attractive as the basis for amalgamation of data from many different sources. In theory, historic arrays can be directly compared with additional experiments and provide an important tool for comparative analyses. However, because of the large number of analytical procedures for normalization and quantification from the oligonucleotide level data, it is greatly preferable to reanalyze primary data in the form of processed image files (termed CEL files, or CELs). This permits substantially more robust comparisons between datasets because the same analytical metric can be applied to the joint data and will ultimately permit more thorough vetting of algorithms to assess gene expression levels from this platform.
Based on the popularity and ease of use of the Affymetrix platform we began to construct a combined resource for the storage of publicly available CELs for ongoing comparison with data generated at the University of California, Los Angeles (UCLA) DNA Microarray Core Facility as part of the National Institutes of Health Neuroscience Microarray Consortium (NNMC). The purpose of this assembly of CELs was to create a substantial reference set of primary data that would then be available for all ongoing projects. As we examined the available CEL file resources, it became apparent that fragmentation of public data into multiple small repositories has effectively occurred despite the presence of two major repository efforts and deposition requirements of journals. Of the more than 30,000 instances of CELs that were collected from 11 institutional servers (Figure ) [1
], fewer than 5% are present as CELs in either GEO or AEX, the two official public repositories. We estimate that up to 90% of generated CELs are not yet deposited in AEX or GEO. In fact, most public CELs are not easy to find. This suggests that the number of publicly available CELs is much larger than that used in our study, but these CELs are not accessible using standard bulk-mode data retrieval network protocols such as network file transfer protocols FTP and Rsync.
Figure 1 Summary of data sources present in Celsius. Data have been imported from several sources, 11 of which are shown. Numerals indicate the number of files within each source. Circle overlap is proportional to CEL overlap between data sources. AEX, EBI ArrayExpress (more ...)
We further note that inconsistent annotation of experiments impedes meta-analysis. Re-use of these data is compromised by the low quality of clinically or experimentally relevant annotated metadata actually available for many datasets, as well as the inconsistent and incomplete implementation of the standards for encoding these metadata [6
]. For instance, no repository uses controlled vocabularies, and therefore the annotation of experiments can be ambiguous and difficult to use when integrating datasets.
Here, we present a community-oriented structure to permit massive amalgamation of microarray data for joint analyses. We have termed this resource 'Celsius', to reflect both the intended community spirit and the restriction to image files generated from the Affymetrix platform. Celsius has four major goals: to import all available Affymetrix primary data, whether published or not, specifically gene expression, genotyping, and tiling CELs; to process imported data using best-of-breed statistical methods made available by the community; to facilitate and encourage community involvement in annotation of deposited samples using controlled vocabularies; and to make available for re-export consistently quantified and normalized data that can be combined without further processing. In this article we describe the methods employed to create this resource, a snapshot of its contents, nascent systematic approaches to annotate samples and genes solely using expression data, and growth rate.