The aim of this study was to compare the transcriptome of human U-2 OS cells with presence of the corresponding proteome. The transcriptome was extensively surveyed close to saturation by performing massive SOLiD DNA sequencing [Additional file 1
: Supplemental figure S1]. In total, approximately 15 million high quality 35-bp reads were obtained and mapped onto the human reference genome (hg18) and quantitative measures were computed on a per gene basis. Analysis of the transcription pattern demonstrated that the majority of all Ensembl genes (73.4%; 15536/21146 genes) were expressed in U-2 OS, i.e., a transcript being represented by at least one uniquely mapped read. The frequency distribution is presented in the additional information [Additional file 1
: Supplemental figure S2].
To create a comparative protein expression set, a non-redundant collection of antibodies and genes was assembled from the Human Protein Atlas [12
]. In the initial collection of data, a high degree of protein presence was observed for both IHC and IF, demonstrating expressed proteins for 88.7% and 73.6% of all genes analyzed, respectively (Table ). In the following analysis, all antibodies with protein expression data from both IHC and IF in the U-2 OS cell line were used. For the genes with more than one antibody directed towards the gene product, the best scoring IF antibody was selected according to a standard validation scheme [Additional file 1
: Supplemental table S1]. The assembled non-redundant set of antibodies was then used to collect corresponding immunohistochemistry and immunofluorescence information from U-2 OS, yielding the HPA subset. The HPA subset consists of 2749 Ensembl genes (with corresponding 2749 antibodies) that all have protein presence/absence information from both IHC and IF experiments (in the U2-OS cell line). Figure shows the obtained data for gene NDUFS4
as an example of the input data for the three included platforms, IHC (A), IF (B) and RNA-seq (C).
Figure 1 Overview of the data types used. (A) Images acquired from IHC were automatically processed and annotated. (B) Images from IF were manually annotated with staining intensity and a validation score. (C) For RNA-sequencing, reads mapping uniquely to exons (more ...)
Protein distribution and overlap with transcriptional data
Using the transcriptome sequencing strategy, we detect transcripts for 85.3% (2123+222) of the genes in the HPA subset (Table ). A large overlap in expression is obvious comparing RNA-seq and the immunological assays. Figure compares presence of proteins and transcripts, and it demonstrates that out of the HPA subset, RNA-seq detects 87.1% (2123/2123+315) of the IHC-detected proteins, and 87.2% (1771/1771+260) of the IF-detected proteins. These numbers are higher than what is expected by chance; a chi-square test results in p-values of 3.4 × 10-13 and 2.6 × 10-6 for IHC and IF, respectively. This supports a strong association between RNA and protein expression. The fact that approximately 13% of all detected proteins does not have a detectable transcript can indicate several different phenomena: (i) Some genes are very lowly expressed as transcripts, but efficiently translated into stable protein products or (ii) Some antibodies are cross-reactive, yielding false positive protein detection.
Figure 2 (A) Overlap of IHC data with RNA data. In IHC, 88.7% of the investigated genes are present. (B) As in (A), but overlap between IF and RNA-sequencing. In IF, 73.9% of the investigated genes are called present. The 'present overlaps' between IHC and RNA-sequencing (more ...)
Interestingly, 9.4% (222/2123+222) of the genes in the HPA subset that were detected on the transcript level are not detected on the protein level. For IF, this number is 24.4% (574/1771+574). It is not clear if this is caused by a subset of genes that are transcribed but not translated, or if this is due to a limited sensitivity in the protein measurements. The fact that this fraction is higher for IF than IHC indicates that unspecific antibody-protein interaction in IHC in combination with a group of transcripts that do not undergo translation is the major contributor to this affect, since the sensitivity of IF is generally higher than that of IHC (see below).
Comparison over three technology platforms
For a more in-depth analysis, we investigated the expressed genes in a combined analysis of IHC, IF and RNA sequencing of the HPA subset (2749 genes). We show that 60.1% (1651 genes) of all investigated genes are detected by all platforms (Figure ) and only 1.2% (34 genes) was not detected by any platform. If only one of the two proteins detection platforms is required to call presence on the protein level (in the case of one of them producing a false positive call), only 3.2% - 5.2% of all genes are not detected on the transcript or RNA level. Interestingly, 71% (1651+205)/(110+472+1651+205+55+120) of proteins detected by either IHC or IF were detected by both methods. In total, IHC detects more proteins than IF (2438 vs. 2031). The higher number of detected genes likely indicates a higher degree of false positives, since genes detected by IF and RNA-seq are more lowly expressed than genes detected by IHC and RNA-seq (Figure , see below).
Figure 3 (A) Venn diagram of presence flags for the three platforms (A = IHC, B = IF, C = RNA-seq). 60.1% of all genes investigated are present in all platforms. Only 34 genes (1.2%) are absent in all platforms. (B) Cumulative transcription density curves for (more ...)
Since RNA-seq provides quantitative measures of gene expression levels, we investigated transcript levels for the genes in the subgroups defined in Figure , where this was possible (Figure ). This showed that the groups 'C' and 'BC' (detected in RNA-seq only and detected in IF and RNA-seq positive, respectively) were expressed at significantly lower levels than all genes combined (Kolmogorov-Smirnov (KS) test, p = 3.8 × 10-4 and p = 4 × 10-4, respectively). This indicates that IF has a higher sensitivity to detect transcriptionally low expressed genes than IHC.
Next, we investigated the overlap between RNA and protein expression using the quantitative RPKM [10
] values as a measure of transcript abundance. This measure is calculated by counting all reads that map to the exons of a gene and dividing by the length of the gene and total number of reads and is an expression value for each gene. The HPA subset was binned in 25 transcriptional levels, ranging from the top 5% of the bottom 5% expressed genes, as well as two transcriptional levels: the upper 50% and lower 50% expressed genes. Table shows a 95% overlap between the upper 50% genes and protein presence based on IHC expression. For the lower 50%, this number drops to 82% and for IF, this overlap is 80% and 68% for the upper and lower intervals, respectively. Interestingly, for the smaller bins, this effect is very similar: The overlap is high (98% for IHC, 86% for IF) for the top 5% and remains relatively similar across the top 50% [Additional file 1
: Supplemental figure S3]. Furthermore, we chose the subset of antibodies that had the highest validation score (supportive staining) in Western blot [Additional file 1
: Supplemental table S1], and as expected, this yielded a slightly higher degree of overlap for both IHC and IF (Table ) suggesting that some of the antibodies with a low validation score might be false positives.
Overlap between RNA and protein based on RNA expression bins.
Gene Set Enrichment Analysis
] is a tool that performs gene set enrichment analysis for several different categories (GO, KEGG, protein domains etc) and has the option to group similar categories into functional groups based on similarity. When the HPA subset (2749 genes) was analyzed for enrichment of gene categories against a background of all protein coding genes using DAVID some Gene Ontology categories emerged as over-represented. These include development, apoptosis, proteins related to direct protein sequencing, the cytoplasm and protein binding (data not shown).
Thus, since the HPA subset is somewhat biased from a gene category perspective, further gene set analysis of the subgroups was done with the HPA subset as background. We noticed that for genes in the ABC group (detected by all platforms, Figure ), certain themes were enriched. Two category sub-clusters were significantly over-represented; intracellular proteins and nuclear proteins (Table ). From a technical perspective, these proteins are located within the cells (or even within the nucleus) and are therefore equally well detected using either IHC or IF. As a contrast, in the BC group (IHC-, IF+, RNA+) we find that extracellular proteins are significantly enriched (a group of proteins usually not detected by IF). Given the higher sensitivity of IF, we might speculate that these are proteins destined for export that still reside within the cells, and thus are present at very low levels.
DAVID enrichment for certain categories
In the AB group (IHC+, IF+, RNA-), we notice that proteins related to glycosylation are significantly enriched. The apparent lack of correlation between RNA and protein expression for this GO category is not fully understood and requires further analysis to elucidate (Figure , see below).
Figure 4 (A) Western blot data for the groups defined in figure 3A. The groups A and AB generally contain a larger fraction of low-scoring antibodies. (B) IF Reliability scores for the same subgroups. The fraction of antibodies with a supportive staining in the (more ...)
Western blot and IF validation score analysis
The Western blots performed within the HPA program are manually investigated and assigned a validation score based on the number of detected bands, approximate size of the bands etc [Additional file 1
: Supplemental table S1] using a standardized protein lysate panel. We analyzed these scores for all groups defined in Figure . We observe that antibodies raised against genes in the A group (positive only in IHC) or the AB group (IHC+, IF+, RNA-) generally contain more low-scoring antibodies (KS-test, p = 1.6 × 10-4
and KS-bootstrap test, p = 2 × 10-3
) (Figure ). This suggests that some of these antibodies are staining the cells in an unspecific manner (false positives) and the RNA data can thus provide guidance for the validation of the corresponding antibodies. Western blot data was not available for our particular U-2 OS cell line analysed in this study, so a direct expression comparison between WB and other methods was not possible.
For IF images in the HPA program, a validation score is added after manual investigation. Comparative analysis of these scores could only be done for groups with staining according to IF, since a validation score of 7 is used to define absence. In the ABC group (present in all platforms), the fraction of antibodies receiving a supportive score [Additional file 1
: Supplemental table S1] is about three times higher than that in the B group (present only in IF) and is confirmed significant (KS-bootstrap test, 2 × 10-3