|Home | About | Journals | Submit | Contact Us | Français|
Various databases have harnessed the wealth of publicly available microarray data to address biological questions ranging from across-tissue differential expression to homologous gene expression. Despite their practical value, these databases rely on relative measures of expression and are unable to address the most fundamental question—which genes are expressed in a given cell type. The Gene Expression Barcode is the first database to provide reliable absolute measures of expression for most annotated genes for 131 human and 89 mouse tissue types, including diseased tissue. This is made possible by a novel algorithm that leverages information from the GEO and ArrayExpress public repositories to build statistical models that permit converting data from a single microarray into expressed/unexpressed calls for each gene. For selected platforms, users may upload data and obtain results in a matter of seconds. The raw data, curated annotation, and code used to create our resource are also available at http://rafalab.jhsph.edu/barcode.
The completion of the human genome led to an unprecedented number of biological discoveries including the sequence and location of genes. However, knowledge of our genome provides little information about what distinguishes cell types. In contrast, knowledge of our transcriptome, the genes expressed in our cells, provides insight into what distinguishes cell types. Here we report on a draft of the human and murine transcriptomes based on public microarray data. Studies based on microarray data (1,2) typically report results in terms of relative expression, for example, which genes are differentially expressed in one condition compared to others. Minimal attention has been devoted to determining which genes are expressed in a specific tissue or cell type. Databases such as EBI’s Human Gene Expression Map (3), TiGER (4), BODYMAP (5), BioGPS (6) and TiSGeD (7) provide comprehensive microarray data from thousands of samples but none focus on absolute measures of expression. Furthermore, these databases are useful for gene-by-gene analysis, but have limited functionality for researchers interested in a global view of transcription for specific tissue types.
Knowing which genes are expressed in which tissues will increase our fundamental understanding of cellular processes and provide a starting point for research targeting drug discovery and individualized medicine. The difficulty in obtaining absolute measures of expression stems from the fact that feature characteristics, such as probe sequence, can cloud the relationship between observed intensities and actual expression levels (8). This ‘probe effect’ is large, yet consistent between hybridizations on the same platform, which implies that it can be cancelled out by relative measures of expression (9). However, the probe effect makes it impossible to interpret differences in intensities between genes. This implies that one cannot infer presence of a transcript from the size of reported microarray values (Figure 1A).
To estimate transcriptomes we extended the original gene expression barcode methodology (8), which provides reliable absolute measures of expression but for a limited number of genes: 2519 for human and 5031 for mouse. This algorithm requires genes to show clear separation between low and high expression measurements to classify groups into silenced and expressed. However, most genes do not exhibit this property. To create a comprehensive estimate of cell-type transcriptomes, we extended the original barcode methodology to provide expression calls for all genes on the array. We achieved this by (i) developing a series of negative control experiments; (ii) leveraging information from 13 824, 18 656 and 9652 publically available chips from the Affymetrix Human Genome U133A (HGU133a), U133 Plus 2.0 (HGU133plus2) and Mouse Genome 430 2.0 (Mouse4302) platforms, respectively; and (iii) developing a novel application of the probability of expression (POE) model (10). This resulting approach provided a new version of the barcode algorithm that provided standardized values that for the first time permitted comparison across all genes (Figure 1B). These standardized values can then be converted to silenced and expressed calls by deciding on a single threshold. We refer to this binary version of the expression values as a ‘barcode’. For details on the new statistical approach and comparisons to existing detection algorithms, see the Supplementary Data. The Supplementary Data also describes how the binary barcode version of the data provides protection against the ubiquitous batch effect (11); a fact described by Zilliox and Irizarry (8) and confirmed by the Microarray Quality Control (MAQC) II project (12). Below we include examples of practical uses of our database.
The barcode database provides absolute calls of expression for 13 824, 18 656 and 9652 samples from the HGU133a, HGU133plus2 and Mouse4302 platforms, respectively, obtained from GEO and Array Express. We manually curated sample annotation associated with these arrays and required at least five biological replicates for each cell type. This resulted in 78, 131 and 89 different normal and cancer cell types for HGU133a, HGU133plus2 and Mouse4302, respectively. In each cell type, for each gene we computed the proportion of samples for which that gene was called expressed. These values defined our estimated transcriptome for each cell type. Tables containing these values for each cell type represented in each of the three platforms are available here: http://rafalab.jhsph.edu/barcode/index.php?page=transcriptome.
We verified the biological validity of our transcriptomes by grouping the genes expressed in CD4+ T cells, cerebellum, liver and skeletal muscle by gene ontology. Functional annotation clustering using DAVID (13) showed that the most enriched biological groups were those expected for a given tissue (Table 1). For example, gene groups involved in synaptic transmission were found in the cerebellum, while groups involved in muscle contraction were specific to skeletal muscle. We provide the analysis for these four cell types only as an example and expect researchers with the appropriate expertise to take advantage of our resource and perform more detailed analysis of the specific genes found in each tissue.
The consistency of barcode expression calls within cell-type replicates was remarkable. To illustrate this we focused on normal tissues from HGU133a, in which 90% of the proportions were exactly 0 or 1. Because 51% of the non-consistent calls were due to just 12% of the genes, we quantified consistency with a gene-specific entropy measure. We speculate that these genes are associated with poor performing probe sets. Another possibility is that they are universally varying genes such as those involved in the circadian rhythm. Because the transcriptome values of these genes should be interpreted with care, we flag them in our database. This classification can be useful to users that are interested in using barcode data in prediction applications such as clinical diagnostics.
Our database also provides a global view of transcriptomes, allowing a user to examine global characteristics of gene expression. For example, in our estimate of the human transcriptome, most genes were primarily off, and a small proportion primarily on, across cell types: 76% of genes were off in at least 80% of tissues and 2% of genes were on in at least 80% of tissues. Among the remaining 22% of genes, 55% were high entropy genes.
In our database, we used the estimated transcriptomes to create catalogs, available from http://rafalab.jhsph.edu/barcode/index.php?page=catalog, with gene-specific information. Specifically, for each gene we reported the cell types in which it was expressed, similar to other databases, and the estimated entropy for that gene. Our database can also be used to determine the genes expressed in a given cell type (http://rafalab.jhsph.edu/barcode/index.php?page=tissuegene) and to compare the genes expressed in two or more cell types (http://rafalab.jhsph.edu/barcode/index.php?page=tissuecomp). A unique aspect of our database is that it can be used to identify genes in a particular tissue that have not been well studied. For example, we found that there were 1298 expressed genes found in CD4+ T cells (excluding genes found in ‘many’ tissues). We then narrowed the list down to Affymetrix grade E and R genes, which correspond to expressed sequence tags (ESTs) and have little or no functional annotation. We also excluded high entropy genes. Of the 130 genes that fit these criteria, 12 were only found in CD4+ T cells and not any other tissue in the database, demonstrating a query unique to this catalog.
While our database differs fundamentally from exiting web-tools in that it can produce barcodes from a single microarray and provides absolute measure of expression, three other databases can be used to create a list of genes that are specifically up-regulated for a given cell type; namely EBI’s Human Gene Expression Map, TiGeR and BODYMAP. We assessed the performance of our resource relative to others using sec-gen RNA sequencing counts as a gold standard. Our database provided a substantial improvement in sensitivity over existing databases without sacrificing specificity. In fact our specificity was equal or superior in all comparisons (Table 2). More details are provided in the Supplementary Data.
Here we present a powerful illustration of the added value our resource provides the research community. There have been conflicting reports on the similarity of gene expression profiles between human and mouse orthologous genes. Some have gone so far as to claim that ‘any human tissue is more similar to any other human tissue examined than to its corresponding mouse tissue’ (14). Data produced with two different platforms, human and mouse, designed with different probes, supported this conclusion. This led others to claim that platform-specific technical variability is the cause of this apparent dissimilarity (15,16). Our resource easily clarifies this controversy. We selected a number of normal tissues represented in both our human and mouse databases. Using reported gene expression measurements to cluster human and mouse tissues, we observed a strong species-effect: species clustered together (Figure 2A). However, using barcode data from our database, the species-effect was removed and we observed perfect clustering among the tissues (Figure 2B). Our result supported the more intuitive biological conclusion that, for example, a human kidney is more like a mouse kidney than a human brain. Because barcodes are designed to be robust to probe-effects, these results provide strong evidence that large species-specific differences are in fact driven by systematic biases. These biases are absent in our database, allowing users to compare transcriptomes between species.
Note that while most computational pipelines require data from multiple microarray experiments for normalization purposes, our resource is able to process data from a single array. A sample of unknown origin can be uploaded here: http://rafalab.jhsph.edu/barcode/index.php?page=sample_process, our resource then processes the sample, compares the result to each pre-computed cell type transcriptome, and reports back distances between the sample-specific barcode and tissue-type-specific transcriptome. The barcode methodology is robust to batch effects and can be applied to data from a single chip, which makes it particularly useful for across study cell type prediction (8). Because our database contains normal and cancer cell types, one can upload a tumor of unknown origin and determine the cancer cell type to which it is most similar. The Supplementary Data includes a promising example illustrating the potential of using our resource for cancer diagnosis.
The Gene Expression Barcode is the first resource to reliably call genes silenced or expressed using data from a single microarray. We harness this functionality to estimate the human and mouse transcriptomes and provide results via our database. Specifically, we can create a gene expression barcode for a single microarray that provides information about the expression states of all genes. We have combined thousands of gene expression barcodes to create vast catalogs of transcriptome information spanning hundreds of cell types and tens of thousands of genes. These catalogs are easily accessible via a series of webtools that allow an investigator to readily access gene and/or cell type specific information. Users can also calculate the transcriptomes for their own samples, which is useful for researchers looking at epigenetic markers compared to gene expression or those conducting genome wide association studies (GWAS) where gene expression is a phenotype. These resources will be updated and expanded to other platforms as more data becomes publicly available; we have created tools that automatically do this.
We note that our approach provides only an early draft of the transcriptomes. We can expect various improvements. For example, although we provide the original expression data, the default barcode data currently dichotomizes the data into two classes mainly to protect against batch effects. Future versions will further stratify the expressed strata. We also realize that much of the non-coding RNA expression is yet undefined, and hence much work on the transcriptome remains. We therefore plan to include second generation RNA sequencing data as soon as enough samples become publicly available to implement an adapted version of our algorithm. However, despite these limitations, we have provided various examples of powerful analyses permitted by the estimated transcriptomes. We expect many more applications to flourish and discoveries to be made as more experts become aware and use our tools.
Supplementary Data are available at NAR Online.
National Institutes of Health, partial (GM083084, RR021967 and UL1RR025005 to R.A.I.); National Cancer Institute (CA132480 to M.J.Z.); National Institutes of Health, partial (T32GM074906 to M.N.M.). Funding for open access charge: R01GM083084.
Conflict of interest statement. None declared.
We thank Yongqiang Zhang for providing us with the yeast RNA and Tianwei Yu for help with the sample parsing code. We thank the maintainers of GEO and ArrayExpress for making the data publicly available. We thank Thomas Louis for his feedback on the statistical aspects of the article. M.N.M. conceived the study, developed the software and drafted the article. R.A.I. conceived the study and drafted the article. M.J.Z. conceived the study, revised the article and provided the data. K.U. and H.A.J. developed software.