Accurate normalization of gene-expression levels is an absolute prerequisite for reliable results, especially when the biological significance of subtle gene-expression differences is studied. Still, little attention has been paid to the systematic study of normalization procedures and the impact on the conclusions. For RT-PCR, there is a general consensus on using a single control gene for normalization purposes. A comprehensive literature analysis of expression studies that were published in high-impact journals during 1999 indicated that
GAPD, ACTB, 18S and 28S rRNA were used as single control genes for normalization in more than 90% of cases [
11]. As numerous studies reported that housekeeping gene expression can vary considerably [
6,
9,
10,
11,
12], the validity of the conclusions is highly dependent on the applied control. Some laboratories have tried to find the optimal control gene for their experimental system, and often rRNA molecules were proposed as best references. These studies should be approached with some caution, as often only the variation in expression of the tested genes with respect to the mass loading of total RNA was assessed. As rRNA molecules make up the bulk of total RNA, they should indeed correlate very well with the total RNA mass, but that does not necessarily make them good control genes. As outlined in the introduction, total RNA and rRNA levels are not proper references, because of the observed imbalance between rRNA and mRNA fractions.
In addition to searching for a stable control gene, we aimed at determining the errors related to the common practice of single control normalization. In this study, we provide evidence that a conventional normalization strategy based on a single housekeeping gene leads to erroneous normalization up to 3.0- and 6.4-fold in 25% and 10% of the cases, respectively, with sporadic cases showing error values above 20. This analysis showed that a few control genes were unstable and significantly differentially expressed in some tissue panels, as evidenced by the decrease from 5.9 to 4.5 for the 90th-percentile single control normalization error value for neuroblastoma when the
B2M gene is omitted (data not shown). This finding agrees with the reported differential expression of
B2M in neuroblastoma, corresponding to the stage of differentiation of the tumor cells [
18]. The error-distribution curves not only reflect the stability of expression of the applied controls, but also the sample heterogeneity within a tissue panel, as noted from the less steep curve for the heterogeneous set of normal pooled tissues compared to the other, relatively homogeneous, tissue panels. In this regard, the issue has been raised that finding proper control genes is even more important when working with tissues of different histological origin [
9].
The single control normalization error values point to inherent noisy oscillations in expression levels of the control genes, a finding which has been corroborated in other large-scale studies where several thousand genes were measured in different cells or tissues by microarray analysis. No gene was found on an 8,000-feature array that did not vary by ratios of at least twofold across a panel of 60 cell lines [
14], and a set of genes frequently used for normalization (including
GAPD and
ACTB) was found to vary in expression by 7- to 23-fold [
9]. Taken together, our data and these studies clearly show that ideal and universal control genes do not exist. This warrants the search for stably expressed genes in each experimental system, and for the development of an accurate normalization strategy.
To validate the expression stability of the tested control genes without any prior assumption of a metric for standardization, we had initially measured the correlation between the raw, non-normalized expression levels of any two control genes, which should be nearly perfect for proper control genes. We observed, however, that the data range between the minimum and maximum expression levels, or any outlying value, could have a profound influence on the slope of the regression line, and consequently on the value of the correlation coefficient. This made Pearson and Spearman correlation coefficients unsuitable for this kind of analysis. We have therefore developed a new stability measure, based on the principle that the expression ratio of two proper control genes should be identical in all samples, regardless of the experimental condition or cell type, with increasing ratio variation corresponding to decreasing expression stability of one (or both) of the tested genes. The proposed standard deviation of log-transformed control gene ratios is a robust measure for the variation between two control genes, as it does not impose any requirements for normality or homoscedasticity of the data points. Furthermore, this measure is independent of the abundance difference between the genes, and is equally affected by any outlying or extreme ratio (that is, outliers for a sample with low or high overall expression, or outliers caused by an upregulated or downregulated gene have an equivalent increase in pairwise variation V). Logarithmic transformation of the ratios is required for symmetrical distribution of the data around zero, resulting in equal absolute values (but opposite signs) for a given ratio and the inverse ratio. As a result, the standard deviation of log-transformed ratios is identical to the standard deviation of log-transformed inverse ratios, which makes this measure characteristic for every combination of two genes.
Having established a robust measure to assess the variation in expression of two control genes, we subsequently defined a gene-stability measure M as the average pairwise variation between a particular gene and all other control genes. Using a VBA applet geNorm developed in-house, we ranked ten commonly used housekeeping genes belonging to different functional and abundance classes according to their expression stability in five tested tissue panels. The clear decrease of M of the remaining control genes during stepwise exclusion of the worst-scoring gene points at differences in the stability of gene-specific expression and demonstrates that the remaining genes are more stably expressed than the excluded genes. Some tissue panels show a relatively steep initial decline, which reflects the exclusion of one or more aberrantly expressed control genes (for example, ACTB and HMBS for leukocytes), as also noticed from the single control normalization error analyses (see above). The average gene stability values of the remaining genes during stepwise elimination of the least stable control genes also indicates tissue-specific differences, with bone marrow and the pool of normal tissues having the lowest and highest overall expression variation, respectively. The latter is no surprise, given the larger tissue heterogeneity in this panel. The question of whether the observed high variation for neuroblastoma is a cancer-related phenomenon of deregulated expression is currently under further investigation. From these analyses, it is clear that there is no universal control gene suitable for all cell types. ACTB and B2M appear to be the worst-scoring genes, whereas UBC, GAPD and HPRT1 seem to be the best overall control genes, each belonging to the four most stable genes in four out of five tested tissues. However, these generalizations should be treated with caution. B2M appears to be one of the least stable control genes, but is nevertheless a good choice for normalization of leukocyte expression levels. This clearly demonstrates that a proper choice of housekeeping genes is highly dependent on the tissues or cells under investigation. This is even more important when considering the differences in transcript abundance of some control genes between different tissues. The large expression differences between the tissues tested for B2M and ACTB, for instance, would definitely result in large normalization errors if they were used for standardization. Interestingly, the observed tissue-specific expression of these control genes is in keeping with their known role or function: there is high B2M expression in leukocytes, where it is a major cell-surface marker, and relatively low non-muscle cytoskeletal ACTB expression in heart tissue, which is predominantly of muscular origin.
In view of the inherent variation in expression of housekeeping genes, we recommend the use of at least three proper control genes for calculating a normalization factor, and present a procedure to determine whether or not more - and if so, how many - control genes were required for reliable normalization. This analysis clearly showed that three stable control genes sufficed for accurate normalization of samples with relatively low expression variation, whereas other tissue panels required a fourth, or even a fifth control gene to capture the observed variation.
The purpose of normalization is to remove the sampling differences (such as RNA quantity and quality) in order to identify real gene-specific variation. For proper internal control genes, this variation should be minimal or none. To validate the gene-stability measure M and the geNorm algorithm to identify the most stable control genes in a set of samples, we have calculated the gene-specific variation for each gene as the coefficient of variation of normalized expression levels. To this end, the raw expression values were standardized to different normalization factors, calculated as the geomean of the most, intermediate, or least stable control genes (as determined by geNorm). The rationale of this analysis is that a normalization factor based on proper internal control genes should remove all nonspecific variation. In contrast, unstable control genes cannot completely remove the nonspecific variation, and even add more variation, resulting in larger so-called gene-specific variations for the tested control genes. This analysis clearly demonstrated that most nonspecific variation was removed when the most stable control genes (as determined by geNorm) were used for normalization, which proves that the novel stability measure and strategy presented here effectively allowed the stability of gene expression in the different tissue panels to be assessed.
Further validation demonstrated that the geometric mean of carefully selected control genes is an accurate estimate of the mRNA transcript fraction, as determined by comparison with frequently applied microarray normalization factors. Although both RT-PCR normalization factors based on geometric averaging are relatively similar, the one based on at least seven control genes (that is, NF
M < 0.7) is slightly more equivalent to the microarray-scaling factors. Two possible explanations can account for this observation. First, the five most stable control genes as determined by geNorm are based on only two RNA samples (that is, a Cy3-labeled reference pool, and a Cy5-labeled test sample), in contrast to the RT-PCR data, where 9 to 34 samples were used, resulting in more reliable estimation of the expression stability. Second, recent technical reports clearly state that array hybridization analyses experience considerable - often underestimated - variation and uncertainty at several levels. Accurate background fluorescence correction and spot quality assessment, among others, have been described as critical issues for reliable ratio estimation [
19,
20,
21]. The higher variability associated with array hybridization results might thus explain the need for more control genes to normalize the data. Nevertheless, this study clearly showed that normalization based on the geometric mean of carefully selected control genes results in equivalent ratio estimation compared to commonly applied array scale factors, which validates its use for RT-PCR normalization. In addition, the method presented could easily be applied to normalize gene-expression levels resulting from microarray hybridization experiments, where only a limited number of genes are spotted, including some housekeeping genes.
In conclusion, we described and validated a procedure to identify the most stable control genes in a given set of tissue samples, and to determine the optimal number of genes required for reliable normalization of RT-PCR data. The strategy presented can be applied to any number or kind of genes or tissues, and should allow more accurate gene-expression profiling. This is of utmost importance for studying the biological significance of subtle expression differences, and for confirmatory and/or extended analyses of microarray results by means of RT-PCR.