Gene expression profiling, including quantitative RT-PCR (qPCR) and microarray experimentation, is invaluable for the molecular analysis of biological systems. The interpretation of results from such experiments (i.e., the determination of differential expression for a particular gene among datasets) is strongly influenced by the selection of reference genes for normalization across datasets [1
]. Specifically, gene expression is normalized within a given dataset by calculating the transcript abundance of the gene of interest relative to a gene that is constantly expressed across independent datasets (termed a "housekeeping" or a "reference" gene), and differential expression between two datasets or samples is determined by calculating the ratio of the normalized expression levels for the gene of interest between the two datasets. Typically, housekeeping genes satisfy the following criteria: they are highly expressed in the cell, the variability in expression between samples is minimal, and the genes' expression is not influenced by the experimental conditions tested [2
]. Hence, problems arise when housekeeping genes are selected that do not meet these criteria, as fluctuations in these genes may erroneously influence the data interpretation.
Historically, beta actin
), glyceraldehyde-3-phosphate dehydrogenase
), and 18 S rRNA
have been routinely used as reference genes for qPCR and microarray data normalization. However, a number of studies have shown that expression of these genes varies considerably depending on the specific tissue type and disease state of the tissue [3
]. Attempts to achieve more reliable normalization include the spiking of synthetic poly-A RNAs for the analysis of cDNA arrays and northern blots, and the combined use of an oligo-(dT)n primer with an 18 S specific primer for qPCR analysis [17
]. In addition, re-mining of large microarray datasets for the identification of novel, highly stable genes, as well as use of a combination of reference genes instead of a single gene for normalization, are some of the other approaches taken to address this problem [11
Recently, efforts have been made to identify more suitable reference genes for microarray and qPCR studies of lung cancer. Specifically, candidate reference genes have been identified from the mining of microarray gene expression data to identify the least variable genes, followed by validation of expression using qPCR [11
]. However, as microarray data do not provide absolute abundance values for transcripts, selection of reference genes from this type of data is inherently problematic. To circumvent this handicap in the utilization of microarray data, we turn to the use of large-scale expression profiling permitted by serial analysis of gene expression (SAGE) experimentation for the identification of novel reference genes optimal for the study of lung cancer. This approach, which we have termed normalization of expression by permutation of SAGE (NEPS
), takes advantage of the fact that SAGE is a transcriptome profiling technique that identifies the absolute abundance levels of transcripts by direct enumeration of sequence tag counts, thus allowing the direct comparison of expression levels across multiple profiles without the need for reference or housekeeping genes [22
adopts a permutation test approach designed for analyzing relatively small sample sizes, such as those typically encountered with SAGE. Unlike the conventional T-test, the permutation test is non-parametric [23
]. The null hypothesis states that the mean gene expression levels in two groups of SAGE libraries being compared (in this case normal and cancer), are the same. For this analysis, samples from both the normal and the cancer groups are pooled, followed by random sampling to create a simulated Group 1 and a simulated Group 2. For each gene, the difference in expression between these two simulated groups was measured. This exercise was repeated 10,000 times, thus generating a simulated mean μ and a simulated standard deviation σ
. The permutation score
(PS) of a given gene is defined by
, where O
is the true difference between the average expression levels in the two groups. Hence, for a given gene, the closer the permutation score is to zero, the more it satisfies the constancy requirement.
To demonstrate the utility of NEPS for selecting genes that satisfy the constancy requirement, we analyzed 24 bronchial epithelial lung SAGE libraries, 2 lung parenchyma libraries, and 11 lung squamous cell carcinoma libraries. From this analysis, NEPS selected 15 genes, which we hereafter refer to as the lung-NEPS reference genes (Table ). We further demonstrate that (1) while these genes perform well as reference genes for lung, they are not satisfactory for normalization of expression data from other tissues, suggesting that reference genes are tissue-specific, and (2) in lung cancer datasets, differential gene expression determination and subsequent pathway analyses are improved after normalization using the lung-NEPS reference genes.