In the field of functional genomics, transcriptome analysis has always played a central role for unraveling the complexity of gene expression regulation. After decades of extensive investigations based on the characterization of genome-wide gene expression through oligonucleotide-based array technologies, transcriptomics has gained new momentum, thanks to the advent of Next Generation Sequencing (NGS). NGS has enabled high-throughput of nucleic acid molecule sequencing such as DNA (DNA-seq) and RNA (RNA-seq) (1
). The establishment of RNA-seq as an attractive analytical tool in trancriptomics, led to a fast development of this technology, decreasing the running cost and offering the possibility to uncover novel transcriptional-related events. Compared with hybridization-based transcriptome studies, where only difference in expression of the ORFs can be addressed, RNA-seq allows to analyze genome-wide transcription, thus providing additional features such as, analysis of novel transcripts, smRNA, miRNA and alternative splicing events. Furthermore, RNA-seq allows the analysis of transcribed but non-translated regions that may act in regulating gene expression, e.g. UTR (2
). Other advantages of RNA-seq compared with microarrays are its high resolution, better dynamic range of detection and lower technical variation (3
). Nevertheless, microarrays represent a well established technology and have been widely used in the last decades, leading to availability of extensive information. More than 900 000 published microarray assays are available in repository databases like Gene Expression Omnibus or ArrayExpress and have been shared within the research community.
To date, several studies comparing RNA-seq and hybridization arrays have been performed. Comparison between the two techniques have been reported in Candida parapsilolis
), Candida albicans
), on the fission yeast Schizosaccharomyces pombe
), Drosophila melanogaster
), Caenorhabditis elegans
), in mice tissues (8
) and in several human cells and cell lines (5
). Several studies based on RNA-seq analysis of the well known eukaryotic model microorganism Saccharomyces cerevisiae,
have been performed (16–20
) and evaluation of the performances of different library construction methods for RNA-seq was also addressed using S. cerevisiae
as a model organism (19
). The reported correlations between microarrays and RNA-seq in detecting normalized expression signal are in different ranges (1
), suggesting possible inconsistency of different processing methods. Higher correlation is overall observed in differential gene expression (DGE) analysis; however, up to date, a comprehensive description of the performances of RNA-seq data in detecting DGE has not been addressed in detail.
There are two major approaches to process RNA-seq data from short reads in order to identify DGEs (21
). With the first approach, which is the most widely used in RNA-seq analysis, reads are mapped onto a reference genome (22
) and the results of gene expression level are dependent on the aligner used in the analysis. Recently, different aligners and algorithm for RNA-seq analysis were compared, based on their mapping quality and splice junctions (24
). The second approach is de novo
assembly of the short reads (25–27
) that does not require a reference genome. Recently, the performances of different transcriptome assemblers have been compared, based on their capability to identify full-length transcripts and on computational demand (28
), however, statistical analysis for DGE identification and comparison between the two approaches was not covered.
In recent years, many statistical methods have been developed to identify DGE through different statistical models based on discrete probability distribution. The edgeR method proposed by Robinson et al.
) has been developed based on an overdispersed Poisson model to explain the variation in the read count data, then the evaluation of the differences across transcripts are estimated using Empirical Bayes method. Trapnell et al.
) presented the Cuffdiff method that relies on beta negative binomial model to estimate the variance of the RNA-seq data for DGE analysis by t
-like statistics from FPKM values. In addition, different transcript isoforms can also be evaluated for their differential expression using Jensen Shannon entropy. Anders and Huber (30
) showed that negative binomial was superior for estimation of variability in read count type data and implemented the method as a DESeq package showing better results in DGE identification, when compared their method with edgeR. Following this, Hardcastle and Kelly (31
) proposed another algorithm to identify DGE from a count data based on the combination of negative binomial distribution and Empirical Bayes approach to estimate posterior probability of DGE. This method also provides the ability to analyze complex experimental setups that can be useful for several biological applications. Last, Tarazona et al.
) proposed the NOISeq method based on non-parametric statistics and empirical models on the noise distribution of count data, and this method was shown to be non-sensitive to the sequencing depth of the data. In addition, this method also has a better control of false discoveries. The development of several statistical methods indicates the maturity in using RNA-seq data for transcriptomics. However, a thorough comparison of DGE analysis among different methods is required in order to increase the understanding of the different steps involved in the analysis of RNA-seq data.
We thus undertook a study with the objective to evaluate the contribution of different factors affecting the detection of gene expression levels during the several steps involved in analysis of RNA-seq data and compare the capability of different statistical methods to capture DGE. provides an overview of our study. We performed RNA-seq data from the cultivations of S. cerevisisae
under two different metabolic conditions. For each condition, in parallel, we also performed traditional transcriptome analysis based on the microarray platform and, additionally, we sequenced the genome (DNA-seq) of the strain from the same initial culture in order to detect eventual genetic variance such as single nucleotide variations (SNVs) and insertions–deletions (indels). Whereas previous genome-wide transcriptomic studies using RNA-seq of S. cerevisiae
were based on the reference strain S288c, we based our analysis on the widely used laboratory strain CEN.PK 113-7D, as this allowed us to further investigate the influence of genetic variation on the gene expression levels estimations using the different methods. We first address the impact of different aligners in detecting DGE: Gsnap (33
), Stampy (34
) and TopHat (35
) and successively evaluate the impact of using five different statistical methods: (i) baySeq (31
), (ii) Cuffdiff (23
), (iii) DESeq (30
), (iv) edgeR (25
) and (v) NOISeq (32
). Additionally, we compared the results obtained with the ‘reference genome method’ with the de novo
assembly using Trinity pipeline (26
). To allow for visualization of the detected transcripts, we provide a versatile transcriptome browser that presents the results generated within this study, and integrates previously published RNA-seq data. This transcriptome browser may serve as a platform for future RNA-seq-based transcriptome analysis of S. cerevisiae
Figure 1. Study design overview. The same initial culture of S. cerevisiae strain CEN.PK-113-7D was used for DNA-seq (gray line) and transcriptome analysis to reduce technical variation and polymorphism. The strain was cultivated under two different metabolic conditions, (more ...)