To assess the fundamentally important correlation between levels of RNA species and corresponding proteins accurately, reliable estimates of their abundance are clearly required. Equally clearly, quantitative methods that yield highly accurate, absolute estimates of their levels would ideally be applied, and currently the method of choice for quantitative proteomics is mass spectrometry, following several purification and separation steps. This approach can provide high levels of accuracy, sensitivity and specificity, but as yet it is not suitable for large-scale analyses. Alternatively, as in this study, relative levels of proteins across samples of multiple cell lines in tissue microarrays can be determined immunohistochemically, minimizing inter-experimental variation by simultaneously staining samples of all of the lines by each antibody. In addition, various types of microarrays have been developed recently that are capable of providing reliable estimates, in conjunction with various statistical models, of absolute quantities of specific mRNA species in samples from spot intensities [
17].
Thus, use of relative techniques like two-color microarray and immunohistochemistry allows levels of large numbers of gene products to be compared in multiple samples. However, it should be recognized that cell lines are model systems that differ in various respects from cells in the organisms from which they are derived, notably many of the regulatory pathways are not present and the chromosomal arrangements are beyond the normal patterns in healthy tissues [
18]. So, findings regarding correlations between RNA and corresponding protein levels in them should be interpreted with some caution. Furthermore, since the abundance of RNA and protein is analyzed in samples of cell lines containing several cells, the values used in subsequent correlation analysis are based on averages for the cell line populations, which may be in varying stages of the cell cycle.
Bearing in mind the above provisos, the distributions of the correlation coefficients obtained in both the cDNA and oligo microarray data comparisons with the protein dataset are approximately normal distributed, although when investigating the density function of the distribution there is a tendency towards a minor peak around a mean value of 0.65–0.75, implying that the gene products can may be divided into two major groups that have different degrees of correlation. Further, the minor peak is enhanced when the correlations are based on Pearson correlation coefficients. Shankavaram et al noticed a similar pattern in their study of NCI-60 mammalian cell lines [
9]. In contrast, the distribution of cDNA versus oligo microarray correlations had more of a beta shape, indicating that data generated from many pairs of corresponding probes in the two array systems strongly correlated, but some pairs yielded results that correlated poorly, which decreased the mean correlation coefficient. This may have been due to poor sequence overlap, i.e. the probes yielding poor correlations may have hybridized to different parts of transcripts that mapped to the same genes according to data in the Ensembl gene database. The degree of correlation between the cDNA and oligo microarray datasets is consistent with the degrees found in previous analyses [
19], but further evaluation of variations between the results of this and previous studies in this respect is beyond the scope of this article. The oligo microarray assay yielded higher correlation coefficients with the protein data than the cDNA microarray assay, probably because the oligo probes had higher specificity, in accordance with expectations due to the lower degree of cross hybridization that generally occurs when shorter probes are used.
The major and minor peaks in the in the histograms of the correlation coefficients between the oligo microarray and protein profiles may correspond to two groups of genes that are regulated by different mechanisms. The genes with high correlations may be regulated solely, or almost solely, at the transcriptional level, in accordance with evidence from the ontological analysis that high proportions of these genes are involved in cellular processes and maintenance, for which there is likely to be little need for complex regulation. In contrast, the weak correlations of the other sets of genes may be due to the effects of complex regulatory mechanisms and/or noise generated in the assays masking subtle changes in mRNA transcripts and protein levels, thereby weakening the correlations.
The weak correlations for gene products with correlation coefficients lower than 0.445 probably have several causes, including various post-transcriptional processes that complicate attempts to obtain accurate estimates of quantities of corresponding mRNAs across the cell lines that are destined for translation. For instance, some mRNAs are strongly retained in the nucleus, which may lead to their levels being over-estimated relative to protein levels. Technical noise generated by the respective platforms (notably due to cross-hybridization in the DNA microarray analyses and variations in the affinity and specificity of the antibodies used in the immunoassays) may also weaken the correlations, and thus increase the proportions of genes with correlation coefficients lower than 0.445. The reason that no correlation was found for certain genes is probably related to the complexity of their regulatory mechanisms, which may weaken their correlations to levels that are not detectable with current techniques, while genes with strong correlations may be regulated solely at the transcriptional level.
The concordance between estimates of RNA levels obtained from the array analyses and the RT-PCR analyses was found to be positively correlated to the correlations found between the RNA and protein levels, but the quantities of transcripts estimated by the RT-PCR assay was not similarly related to the RNA-protein correlations. The number of samples is too small to draw definitive conclusions, but these results suggest that if the accuracy of RNA estimations is increased (based on the correlation with the RT-PCR assay), the correlation between RNA and protein levels is more likely to be high (Additional file
9). In addition, the analysis of RNA levels estimated by RT-PCR showed that the mean correlation is higher than the array based platforms, albeit the number of samples in the analysis is also small. This implies that a more accurate estimate of the RNA levels is likely to increase the overall correlation, and that the cause of low correlation are mainly caused by variable accuracy on the RNA level and not the protein estimates.
We have shown here that the correlation coefficients between RNA and protein profiles for 1066 gene products across 23 cell lines vary widely. The mean correlation coefficient is ~0.3, but the groups of genes represented by a major peak at mean value ~0.3 and a minor peak at mean value 0.65–0.75 have significantly different mean values, which may reflect differences in their regulatory mechanisms. Utilizing RNA data from two independent microarray formats, and immunohistochemical data obtained using antibodies applied in the Human Protein Atlas initiative, we found significant correlations between the RNA and protein profiles of 33% of the gene products. Although transcriptional profiling cannot be considered a high-throughput approach for the validation of affinity reagents, when correlation measurements between RNA and protein levels are available they provide additional information regarding the performance of employed antibodies. Further, when the RNA estimates are highly accurate the correlation between RNA and protein levels has a tendency to increase. However, while high correlation values might support antibody specificities, observed discrepancies between RNA and protein levels do not necessarily imply that the antibodies perform poorly, since they could be due to various biological factors, such as complex gene regulatory mechanisms.