Understanding the dynamic mechanism behind a biological process and the identification of the statistically meaningful changes in the expression levels of selected genes involved in the various processes require quantification and comparison of dynamic data. Real-time RT-qPCR is widely used to investigate the relative expression levels of target genes in detail. However, the selection of suitable reference genes imposes problems in the analysis of both transient and non-transient expression profiling studies. The quantification of the dynamic response by real-time RT-qPCR requires the identification of the reference genes, which display constant expression across all time points regardless of the different genetic or environmental perturbations. Although the commonly used reference genes previously reported in literature
[4],
[5],
[6] constitute a suitable pool for the analysis of non-transient data, these genes were observed to fall short in fulfilling their roles as reference genes in the analysis of dynamic gene expression (data not shown). Therefore, a pool of candidate reference genes needs to be identified for dynamic studies. The most suitable set of reference genes would then be determined among the proposed candidates under the studied experimental conditions.
In this study, high-throughput transient gene expression data were used to identify a set of candidate reference genes for real-time RT-qPCR studies. The reference genes were selected from a large pool of dynamic microarray data sets such that they displayed stable expression profiles across time regardless of the type of experimental condition.
The initial stage of this study required the collection of publicly available time series microarray datasets. It was observed that the use of key words such as, “times series” or “time course” was insufficient to extract all the necessary information from the database. This fact signified the importance of human intervention in acquiring information from electronic sources. The content of the datasets needed to be carefully investigated by the researcher to evaluate whether the set was coherent with the context of the study or not. Only then a comprehensive set could be attained.
Two different approaches were used in the identification of a set of candidate reference genes. One approach utilized individual experimental data sets for the determination of stability rankings for each set. The frequency of occurrence of each gene among the most stable 100 genes was determined as a measure of selection criterion. The most frequently encountered genes across all experiments were identified as candidates. In the other approach, however, all available data were merged into a single complete dataset and the overall stability profile based on CV values was used as the second selection criterion. Three genes were identified by both approaches; TDH3, TPI1 and CDC19. All three genes were determined as reference in at least either one of the two case studies.
The advantage of this present strategy for the identification of a pool of candidate reference gene sets is the flexibility of the method. It enables the implementation of other approaches in a modular manner. By this means, the candidate gene pool may be extended to meet other specific needs that might be required.
The approach, in which a combination of datasets were used to identify reference gene candidates through the calculation of CV values, showed that the number of the datasets used for the identification of stable genes affected the stability order of the genes. This result indicated the necessity to include as many datasets as possible in the analysis to obtain more reliable results. Thus as many time course experiments conducted using S. cerevisiae as possible were tried to be included in the present study. The strategy is not limited to the currently available datasets but allows the inclusion of additional data sets and this is advantageous in terms of improving the results obtained from this approach.
The stability analysis of the collected time series datasets revealed that the transcripts that take place in the super pathway of glucose fermentation (FBA1, TPI1, PGK1, CDC19 and PDC1) tended to display stable expression profiles in time course studies. Furthermore, these transcripts were verified to display stable expression profiles experimentally thus presenting a good alternative as reference genes in the normalization of real-time RT-qPCR data. The dominance of fermentation–related genes as stable reference candidates in time course studies appears to be an interesting result, which requires further investigation.
In this study, geNorm and NormFinder algorithms were used for the identification of the most stable genes among the candidate list. For this purpose the results obtained from these two software programs were compared and the transcripts, which displayed the most stable expression profiles in both applications, were identified. The stability of the genes was evaluated based on a scoring system that allowed to display their average stability rankings. However, obtaining similar results using both software programs highly depended on the gene set to be analyzed since geNorm algorithm might be very sensitive to the existence of any correlated gene pairs. The results clearly showed that the exclusion of one of the correlated genes altered the stability order and the stability scores of the transcripts would be low only if the correlated genes were excluded from the candidate list. Our approach () is based on the elimination of the correlated gene pairs according to the results of the real-time RT-qPCR experiments under the selected conditions rather than the a priori elimination of genes that were reported to be correlated in the literature. This approach enabled the observation of any possible correlations among genes investigated in the samples of the current case studies and avoided the unnecessary exclusion of the candidates, which could possibly be among the most stable genes.
This study provided additional evidence that, there are no universal reference genes, which could suitably be used under different experimental conditions. Here we propose a candidate gene set, which can be used for the normalization of dynamic expression profiles of the target genes in S. cerevisiae. In order to confirm the suitability of this candidate gene set, we investigated the transcriptional profiles of HAP4 and MEP2 genes under two different experimental conditions using both the newly identified reference genes and the commonly used ones. These analyses clearly demonstrated that the newly identified reference genes outperformed the conventional candidates in dynamic expression profiling analysis.
The responses of
HAP4 and
MEP2 to an impulse-like addition of glucose and ammonium, respectively, were reported in several studies in the literature
[17],
[18]. Investigation of the expression profiles of genes with well-documented responses in an experimental condition was shown to aid the evaluation of the optimum number of reference genes selected from the candidate reference gene set. Therefore it would be suggested to conduct evaluations regarding the candidate reference gene set using such control genes with known expression profiles prior to conducting the analyses on the genes of interest.
The reference gene sets were shown to display stable expression profiles across time in both of the cases that were studied. Moreover these gene sets outperformed their individual members in terms of stability in time. The genes that were identified as the least stable both by NormFinder and geNorm were also experimentally shown to display variations in the expression profiles as well as 18S rRNA, which was also determined to be unsuitable as a reference gene under the stated conditions.
It should be noted that for each specific experimental condition that would be investigated, a different set of reference genes would selected from the candidate gene pool. The nature of the perturbation or the experimental setup would result in the identification of different reference genes for each specific experimental condition throughout the dynamic range of the experiment. In fact in Case Study I, in which the amount of glucose used as the sole carbon source was varied,
ACT1 was identified among the most stable reference genes although it alone was not sufficient for normalization. On the other hand,
ACT1 expression was shown to be unstable in another experimental setup for investigating diauxic shift, in which the cells starving for glucose switched to utilizing other carbon sources such as ethanol
[4]. Yet another study utilized
ACT1 as reference for monitoring the expression levels of glycolytic genes in response to a switch from growing on ethanol to growing on glucose
[6].
Several methodologies were used previously in the selection of RT-qPCR reference genes. One of the most commonly utilized strategies is the selection of one or more of the reference genes that are frequently cited in the literature
[3],
[5],
[10],
[20],
[21],
[33],
[34],
[35]. In rare instances, several different approaches were also utilized. In a study for validating reference genes for quantitative expression analysis by real-time RT-qPCR in
S. cerevisiae, Teste
et al. selected suitable microarray datasets, in which the culture conditions reported for these datasets were closest to their experimental setup. The potential reference genes were selected among the transcripts displaying stable expression profiles in these microarray datasets. Additionally traditionally used reference genes were also included in the study
[4]. Another study focused on systematically collecting microarray data for selecting reference genes under a specific set of conditions
[1]. In yet another study, the candidate reference genes were determined via stability of their expression levels across different tissues using their EST profiles
[3]. Existence of a pool of commonly used reference genes has proven very useful in time invariant RT-qPCR analysis. However, these reference genes were often found to display far from stable expression profiles in dynamic studies mandating researchers to seek alternative ways to identify novel reference gene candidates
[4]. Unfortunately currently utilized approaches usually fail in the identification of a universal set of candidate reference genes to be used in dynamic studies regardless of experimental conditions. Our approach may help serve this purpose by gathering dynamic microarray datasets having as diverse experimental conditions as possible for the identification of a universal pool of candidate reference genes, from which users may select subsets that would be suitable for their particular needs.
The present approach allows its users a vast space to manoeuvre using the proposed candidate reference genes. Although the specific cases presented in this study focused on transient changes in the amount of available glucose or ammonium in the fermentation medium in yeast, suitable reference genes would be selected among the candidate pool for completely different experimental conditions regardless of whether such an experimental setup was previously analyzed or not. The candidate pool itself was created from a collection of dynamic datasets with a diverse set of experimental conditions. Among these, environmental variations including osmotic stress, heat shock, cell cycle, DNA damage, nutrient availability, chemical treatments, desiccation stress, nitrosative stress and genetic mutations including deletion and overexpression of genes were included (
Table S1). The genes in the candidate pool were identified such that regardless of the diversity of the experimental conditions under which the microarray data were generated, the expression profile of each candidate was stable. This strength of the method of selection increases the possibility of identifying reference gene among the pool of candidates, which would have stable expression profiles across time in a specific experimental condition that was not previously analyzed.
It can be concluded that the pool of candidate reference genes determined in this study; TPI1, ACT1, TDH3, FBA1, CCW12, CDC19, ADH1, PGK1, GCN4, PDC1, RPS26A and ARF1, may be used to identify the set of suitable reference genes in the analysis of dynamic transcriptional data by real-time RT-qPCR in S. cerevisiae under different experimental conditions. However, it is also undeniable that, as the number of the publicly available time course microarray datasets increases, this candidate gene set may be improved. The flexibility of the methods used in this approach enable the inclusion of additional datasets thus constant update of the candidate pool. Additional methods could also be implemented in a modular manner to enhance the results obtained from this study. This study showed the significance of researchers' intervention at various stages of reference gene selection both during the inclusion of new dynamic datasets to be used for the determination of the candidate set and during the exclusion of the correlated gene pairs from the candidate pool while identifying the most stable genes.