We described a strategy for absolute gene expression profiling based on meta-analysis of large-scale microarray data, and introduced the Gene Expression Commons system as a comprehensive discovery platform. The common reference and probeset meta database provide significant advantages over conventional relative gene profiling. For each microarray, gene expression can be measured relative to the common reference that provides absolute gene expression values for comparison between many microarray experiments.
The concept of using a common reference to normalize large amounts of Affymetrix data was first proposed and developed by Katz et al. 
in 2006. Katz et al. used 1,614 diverse biological samples comprised of 251 tissue and pathological categories as the common reference. In 2007, Day et al., described the Celsius system in which by a selected ‘quantification pool’ of 50 heterogeneous mixture of samples is held constant for all quantification events 
. However, as we shown by computational simulation in , this size of data pool is not large enough to establish stable common reference. In our system we used almost all of the publicly available gene expression data (>10,000 arrays) to enhance the stability of the common reference. However, the drawback of this large-scale common reference strategy is the computational cost of the normalization process. Each time a new microarray is submitted, the data is re-normalized against all of the microarrays in large-scale common reference, which takes several hours on a dedicated server. Recently, McCall et al. expanded the work of Katz et al. and develop a new strategy called frozen RMA (fRMA) 
. fRMA first computes probe-specific parameters required for normalization from large-scale publicly available microarray data, then normalizes each additional microarray using those pre-computed values. Therefore, the fRMA method has a significant potential to improve the processing speed of normalization in Gene Expression Commons.
Oncomine is a database of cancer microarrays and provides a platform for differential expression analyses comparing most major types of cancer with respective normal tissues as well as a variety of cancer subtypes 
. In databases like Oncomine, Flymine 
, Ingenuity, EMAAS 
, MiMiR 
and many others around the world, microarray data can be imported, queried and visualized for a selected gene across all analyses or for multiple genes in a selected analysis. However, none of these databases provide an analysis platform for absolute gene expression profiling. In 2007, Zilliox et at introduced a method to compute thresholds that distinguish expressed from unexpressed genes, as part of a system to define tissue-specific ‘gene expression bar codes’, using 1092 manually curated human samples and 236 mouse samples 
. They added a web interface in 2011 to obtain a bar code for a particular sample uploaded by user (http://rafalab.jhsph.edu/barcode/index.php
. In this method, the authors used the smallest mode, defined as local maximum of the estimated density distribution, and standard deviation estimated from expression estimates to the left side of the smallest mode. By contrast, Gene Expression Commons uses random samples from a large pool of microarray data as common reference, and sets a threshold to divide the expression of each gene into “low” and “high” values (instead of “present” and “absent”). This threshold is computed by sorting the expression values from low to high, then using the StepMiner algorithm to fit a step function to the data. Low and high values may be more appropriate for finding signature genes that differentiate cell types, since genes are often expressed at a low level in many cell types, but at dramatically higher levels in a small number of cell types of interest. We would encourage investigators to experiment with web interfaces for both systems to find out which method is most appropriate for their purposes.
Since hematopoiesis has been one of the most studied tissue stem cell based systems, numerous efforts have been invested in microarray analysis of hematopoietic cells, especially hematopoietic stem cells (). However, each study used a different protocol for the purification of HSCs, and used different cell populations as the counterpart to obtain ‘differentially regulated genes’. Thus each result is project-specific and is difficult to generalize.
Microarray gene expression profilings tageting HSC based on relative comparison.
Recently, more comprehensive approaches to profile gene expression of hematopoietic systems have been introduced. BloodExpress collected published microarray data of 37 distinct mouse hematopoietic populations ex vivo
or after culture, as conducted by 15 different projects/laboratories 
. BloodExpress processed each microarray data by the MAS5 method which does not consider differences in the dynamic ranges of probesets, and each gene was classified into binary Present or Absent states. In terms of data processing strategy, BloodExpress’ binarization by MAS5 was a practical first-pass categorization of gene expression. On the other hand, the data integrated into BloodExpress were highly heterogeneous, and the classification of all genes into “present” or “absent” categories is overly simplistic.
In 2008, Heng and Painter proposed an aspiring project named ‘Immgen Project’ to establish a complete ‘road map’ of gene-expression and regulatory networks in all immune cells 
. This project is aiming to generate microarray data of over 200 immune cell types by a highly standardized protocol. However, they do not provide absolute gene expression because their arrays are not compared with arrays from other tissues.
To overcome those limitations, we sorted and profiled 39 mouse hematopoietic populations using very strict cell surface criteria, and the most modern sorting strategies. All these data have been loaded onto the Gene Expression Commons and will be made available to the public. Moreover, because of the advantage of the common reference strategy, incorporating additional data of new populations in future will not detectably change the gene expression readout of existing populations. Thus, it is our belief that the Gene Expression Commons will serve as a common platform for absolute profiling of gene expression in the hematopoietic system.
The Gene Expression Commons has many other potential uses. For example, one can enter the name of a gene, and rapidly determine the quantitative expression of that gene in each cell type. Alternatively, one can query any cell type within a model to obtain a list of genes expressed exclusively in that cell type, or concomitantly with a defined subset of other cell types. This could be done in mice, where mutant and lineage tracing strains exist to identify candidate genes that may be important in cellular differentiation. Another possible use is for pharmacology, where the expression of drug targets and potential toxicities to hematopoietic stem and progenitor cells can be predicted.
Here we demonstrate that absolute gene expression profiling can be achieved by establishing large-common reference data and meta-analysis. This strategy advances gene expression analysis beyond conventional profiling with small numbers of samples. Additionally, this strategy can be applicable to other platforms for high-throughput assays including exon arrays, microRNA arrays, or DNA methylation arrays. The strategy is implemented into a web-based open platform termed “Gene Expression Commons” (https://gexc.stanford.edu/