As a method for characterizing global changes in transcription, RNA-Seq is an attractive option because of the ability to quantify differences in mRNA abundance in response to various treatments and diseases, as well as to detect alternative splice variants and novel transcripts
]. Compared to microarray techniques, RNA-Seq eliminates the need for prior species-specific sequence information and overcomes the limitation of detecting low abundance transcripts. In addition, early studies have demonstrated that RNA-Seq is very reliable in terms of technical reproducibility
]. As a result, biologists studying an array of model and non-model organisms are beginning to utilize RNA-Seq analysis with ever growing frequency
]. However, without experience using bioinformatics methods, the large number of choices available to analyze differential expression can be overwhelming for the bench scientist (see Table one in
Essentially, RNA-Seq consists of five distinct phases, 1) RNA isolation, 2) library preparation, 3) sequencing-by-synthesis, 4) mapping of raw reads to a reference transcriptome or genome and 5) determining significance for differential gene expression (for review see
]). In an effort to familiarize the bench scientist with the post-sequencing analysis of RNA-Seq data (phase 5), we have developed an analysis strategy based on currently available bioinformatics tools. Here, we compare three statistical tools used to analyze differential gene expression: edgeR, DESeq and Limma
]. Based on their performance, we present an analysis strategy that combines these tools in order to generate an optimized list of genes that are differentially expressed. Finally, we highlight several aspects of RNA-Seq analysis that have the potential to lead to misleading conclusions and discuss options to minimize these pitfalls.
While the aim of this paper is to familiarize the molecular biologist interested in undertaking an RNA-Seq project with the methods and issues related to post-sequencing analysis, emphasis still needs to be placed on proper handling of RNA samples. Here, we isolated high quality RNA (Additional file
) using a well-established protocol for soybean leaf tissue
]. In addition, care was taken during the library construction and sequencing-by-synthesis phases, as evidenced by the high quality scores for each sample (Table
). As a result, the average number of usable reads per sample was >20 million, which is the recommended depth required to quantify differential expression in a species with a referenced genome
It is also important to utilize a valid experimental design for RNA-Seq projects, which includes the use of biological replicates. Reports demonstrating highly reproducible RNA-Seq results
] make it tempting to reduce sequencing costs by only using one replicate per treatment group. However, without replication it is impossible to estimate error, without which there is no basis for statistical inference
]. Therefore, it is recommended that RNA-Seq experiments include at least three biological replicates per treatment group
], as was done in the experiment presented here.
Along these lines, it is important to understand the nature of RNA-Seq data and why it is necessary to use a compatible statistical method, such as a negative binomial distribution
]. For discrete variables such as count data, it is possible to associate all observed values with a non-zero probability. In contrast, there is zero probability that a specific fluorescence value (continuous variable) will be obtained from microarray hybridization. This distinction is important in the context of the varying number of total reads obtained for individual RNA-Seq samples. For example, the probability of mapping 100 reads out of 16.86 million (Table
; Sample3) for a particular gene is different than mapping 100 reads out of 36.41 million (Table
; Sample1). To deal with this issue, both edgeR
] and DESeq
] normalize the read data based on the total number of reads per sample prior to differential expression analysis.
The main goal of this work was to compare the accuracy of two statistical tools, edgeR and DEseq. At first glance, it appears that both tools perform equally well (Figure
, Step B). However, when the differentially expressed genes from edgeR and DEseq were intersected (Figure
, Step C), quite a few genes from each list were eliminated (2,242 total genes). Because of this, we adopted a strategy to identify genes that were determined to be differentially expressed by both edgeR and DESeq. In other words, greater confidence was achieved if a gene was determined significant by each of the statistical tools.
This strategy made it possible to follow the genes that were eliminated and to identify aspects of the analysis that have the potential to lead to erroneous conclusions. One aspect to consider is how each of the different statistical tools is designed to handle and report ‘zero reads’ or transcripts that are not expressed in a given treatment. For example, DESeq will output 'Inf' or '-Inf' to excel as the log2
fold change value for genes that fail to align any reads for all control or treatment samples (Table
). In contrast, edgeR outputs log2
fold changes values that are unrealistically large. It is possible that some of these genes could reveal important aspects of global transcription that were altered (i.e., genes that were turned on or off by the treatment) and should not be inadvertently removed. In many cases, however, these genes had very few reads for each replicate as well as for each treatment (Table
). Transcript abundance this low, while determined to be significantly different, is unlikely to be biologically relevant and should be removed from the analysis. Care should be taken when choosing an arbitrary cutoff, however, to prevent the elimination of genes that may play a transcriptional role in response to the treatment being investigated. In this case, we used a conservative RPKM value <1.0 that resulted in the removal of 1,608 low abundance genes (Figure
, Step D).
Expression data for low abundance genes
Another aspect that has the potential to confound RNA-Seq analysis deals with the issue of statistical stringency. In Table
, we demonstrated that for several functional categories, the marginally significant genes eliminated from the optimized list did, in fact, respond to elevated ozone in a manner similar to the genes ultimately determined to be differentially expressed. Therefore, it may be more appropriate to perform network analysis for individual metabolic or signal transduction pathways using the entire RNA-Seq dataset, not just the genes determined to be differentially expressed
]. However, this strategy is limited by pathways that have been previously characterized and would fail to uncover new connections, especially unknown signalling relationships.
One final issue revealed by this analysis was the increase in type II error for high abundance genes (Table
). Several of the genes determined not to be differentially regulated by one or both of the statistical tools are involved with processes that have been well characterized to be regulated to elevated ozone, including decreased photosynthesis (Glyma05g25810 and Glyma17g37280)
], increased antioxidant capacity (Glyma11g11460)
] and increased protein turnover (Glyma20g27950)
]. However, these genes were determined to be differentially expressed based on statistical analysis of RPKM values. This problem undermines the effectiveness of performing RNA-Seq analysis to uncover novel relationships because it will fail to identify all of the high abundance genes that are differentially regulated in response to elevated ozone; genes that are more likely to impact biological processes, especially metabolic functions.