Analysing RNA-Seq data for gene expression has traditionally required genomic resources for the species of interest in order to map and annotate reads. Greater sequencing depth and read length, more advanced assembly software [6
] and most importantly, lower costs, now make RNA-Seq an attractive alternative to designing and using custom microarrays for researchers wanting to study the transcriptomes of species that don’t have genomic resources. For such non-model species one route to using RNA-Seq for expression insights is to perform de novo
transcriptome assembly and use this assembly as a scaffold for quantitative RNA-Seq mapping. While this has been done using the 454 platform (e.g. [9
]), the small number of reads typically provided per run (ca. 1 x 106
) makes this perhaps only accurate for the most highly expressed genes. Currently in the literature there is much discussion about how many RNA-Seq reads are necessary to generate repeatable quantitative measures for middle to low expressed genes, with emerging empirical results suggesting at least 10 to 30 million reads are necessary ([25
], but see [36
]). This strongly suggests that using the Illumina platform, which can provide two orders of magnitude more reads for less than half the cost of the current 454 technology, is the best way forward for quantitative expression analysis. Thus, here we have assessed the performance of Illumina sequencing data in the non-model species context.
To date, only a handful of studies have applied the Illumina approach for quantitative RNA-Seq expression analysis in non-model species. In their investigation of the transcriptome profiles of parasitized vs. non-parasitized Plutella xylostella
moths, Etibari and colleagues assembled all of their Illumina reads into a de novo
transcriptome and consequently used this as a scaffold for mapping their differently parasitized groups. Annotation of the P. xylostella
transcriptome used a BLAST search in NCBI [18
], allowing them to identify differential expression of many metabolic and immune genes. A different study developed a pipeline that facilitates the assembly and annotation of non-model species transcriptomes through utilisation of the genomic resources of related organisms. This method allowed the researchers to perform expression profiling and also to increase the quantity and quality of sequence data available for their targeted species, the Chinese hamster [17
The current study was motivated by questioning the general accuracy of the de novo
approach exemplified by these two studies. While their conclusions are well justified, here we have worked to attain a deeper understanding of the potential errors and biases that might underlie such analyses. One major concern is the level of bias in the genes that are finally included in the analysis. Genes assembled and annotated are not likely to be a random sample of the genome, since highly expressed genes will likely be assembled and annotated best. Etibari et al. [18
] found no bias in GO terms between their de novo
transcriptome assembled for P. xylostella
compared the silkworm Bombyx mori
, which diverged approximately 120 million years ago (Wheat & Wahlberg, in review). However, the GO terms that are liable to be shared between these two Lepidoptera are themselves likely to be highly biased, as the greatest number of B. mori
genes having annotations are those in common with the genomic reference species Drosophila melanogaster
. Given the deep divergence of D. melanogaster
from B. mori
, which last shared an ancestor approximately 330 million years ago (Wheat & Wahlberg, in review), the only genes likely to be functionally annotated are those with highly conserved function and constrained evolutionary dynamics. Such housekeeping genes are thus very likely to be those highly expressed in P. xylostella
, and thus unbiased with respect to the annotated genes in B. mori
. The Birzele et al. [17
] study used Velvet to assemble their transcriptome. This is of concern given recent comparative assessment of transcriptome assembly software packages, which found Velvet to perform among the worst software programs for use upon their transcriptome dataset [37
]. We therefore wished to use an assembler which had previously been shown to perform well [37
] and so chose the CLC method.
In order to assess the performance of RNA-Seq, we addressed several steps of this approach, beginning with de novo
transcriptome assembly. The effect of sequencing technology and volume of data upon the quality of the de novo
TA was assessed by comparing 8 different TAs using RNA-Seq data for H. sapiens
. In comparison to previous examinations of assembly performance that have used simulated data [37
], using data from H. sapiens
gave us a unique combination of genomic insights and real world data. We find, as have others, that the standard metrics commonly applied may lead to a sub-optimal choice of TA [37
]. While metrics solely based on contig lengths suggested that TAs composed of only Illumina reads performed best, in-depth investigation using genomic resources showed a different picture. By using BLAST to identify putative orthologous relationships between TA contigs and the predicted gene set of humans, the aligned region between the two could be determined. By dividing this aligned length by the full length of the predicted gene provides a ratio indicating how much of the coding region a TA contig has successfully reconstructed. Here we have used this approach to calculate the entire amount of the predicted gene covered by all the different TA contigs in a given assembly, and referred to this as the contig reference ratio (CRR). After such comparison between TA contigs and their putative ortholog appeared in the first de novo
transcriptome assembly [12
], similar ratios have been developed (e.g. [39
]). We find the three most informative ratios are for: 1) all possible TA contigs (all CRR), 2) the longest TA contig per ortholog (longest CRR), and 3) the sum of the ortholog length covered by all the TA contigs, which is then divided by the full length of the ortholog (sum CRR). While informative, all CRR, which was used by O’Neil et al. 2010 [39
], inflates assumed TA performance since several contigs for the same gene may be quantified while providing no information as to the total amount of each ortholog a given TA covers. Longest CRR is perhaps the single best metric for assessing TA performance. Ideally, the best TA would predict single contigs that covered the full length of each transcript, as well as the different isoforms, without any over prediction. Here we have used the longest CRR only once, for assessing our TAs (Additional file 2
: Figure S 1
) and this provided very similar insights to that of our sum CRR results. Throughout our paper we have almost entirely used a sum CRR because we wish to know how much of each gene we have covered in our TA, since maximizing coverage is necessary in order to generate a good scaffold for mapping the RNA-Seq data and this information is combined on a per gene basis.
Availability of the CCDS predicted gene set allowed us to ascertain the level of TA coverage for each gene. Although pure Illumina-based TAs had fewer and longer contigs than both the pure 454 TA and the hybrid TAs (composed of both Illumina and 454 reads), the pure Illumina TAs also had a much lower coverage of the transcriptome compared to the hybrid assemblies, at both the individual gene and total transcriptome level.
Using the hybrid TA as a scaffold produced RNA-Seq mapping results that were similar to directly mapping to the CCDS predicted gene set from the H. sapiens
genome. Although there were approximately 2000 fewer genes mapped when using the TA as compared to the CCDS gene set, the correlation in whole gene expression values between these two methods was extremely high (Spearman’s ρ = 0.94). Similar high levels of correlation were observed across technical replicates when mapped to the hybrid TA assembly. While additional improvements could be made in the de novo
approach, the correlation between the two approaches is already much higher than that observed in comparisons between RNA-Seq and microarray results (Spearman’s ρ = 0.73; [40
]). Thus, hybrid TA assemblies, combining Illumina and 454 reads, emerge as the best assemblies and scaffolds for RNA-Seq mapping. We should note that this study has utilised Illumina RNA-Seq data that was available at the time, technology is advancing at a rapid rate and the quality of de novo
transcriptomes that can be assembled with the latest sequencing data (e.g. Illumina’s HiSeq 2000: http://www.illumina.com/systems/hiseq_2000.ilmn
) will likely surpass what we show here. Thus this study should be taken as a comparative study, and a conservative guide.
After mapping RNA-Seq reads to a TA (whether it be a de novo
assembled one, or a transcriptome already available for the species), the contigs need to be assigned to genes for grouping and functional annotation. In non-model species, the ability to obtain significant biological insights into gene expression variation is limited by gene functional annotation. In model systems obtaining gene expression values and assigning these to a growing number of genes of likely biological function is well developed. Non-model systems necessarily must tap into this reservoir of data using BLAST and assumptions of gene function homology, and the genome or gene set of a related but potentially very divergent species. While this approach can be successful (e.g. [12
]), what effect does increasing evolutionary distance from the focal species have upon functional insights?
Gibbons and colleagues investigated the accuracy of ortholog prediction between increasingly divergent species, using RNA-Seq data from one species and genomic proteome data from a GRS [41
]. They observed a decrease in the number of successfully identified orthologs (contig/GRS gene pairs) with increasing divergence. Their study spanned a time-frame of 300 million years from the target species, with the two youngest GRS being 40 and 150 million years divergent from the focal species. Here we sought to investigate whether a negative effect of divergence is observed within the 0–150 million year time frame. Although we expected that as evolutionary distance increased between the GRS and H. sapiens
there would be a significant decrease in the number of genes in the GRS gene sets that found a hit in the de novo
TA, this effect was weak up to 100 million years of divergence (Figure a). A similar effect was observed when assessing the amount of each gene that was covered by TA contigs. Thus there is little negative effect of using GRS as reference datasets for the grouping contigs as divergence increases up to 100 million years, although beyond this age, the number of genes having good coverage assigned decreases.
These results also suggest that the use of proxy GRS up to 100 million years divergent from H. sapiens
for grouping TA contigs might result in only a small bias on expression levels compared to directly mapping RNA-Seq reads to the CCDS dataset. Within this time frame proxy GRS are also likely to enable successful measurement of expression levels as a high correlation in expression between these two methods was found in all cases, even when comparing Mus musculus
(mouse), which is c. 75–91 million years divergent from H. sapiens
(Figure d; [30
]. However, expression results using these divergent species as proxy references also suffered from a level of incorrect assignment of TA contigs to genes, and this negative effect was found to increase with evolutionary distance (Figure c). While this effect was moderate using GRS species up to 100 million years divergent, comparisons using Platypus as the GRS showed both a dramatic increase in incorrectly assigned TA contigs and a lower correlation with the CCDS mapped expression values (Figure b). This identified error was mirrored in the gene set enrichment analyses, as incorrect contig assignment should be greatest for genes that have higher evolutionary rates, or conversely, lower for constrained genes (Figure c). Error is likely to accrue during the BLASTx search of the TA contigs against the GRS dataset, and indeed when this BLASTx was repeated for two of the GRS using more stringent parameters of identity less error was encountered. However, a repercussion of this was reduced coverage of the transcriptome in terms of genes that could be annotated.
Several important limitations of our approach should be noted. First, there are certainly many species that do not have a genomic reference species less than 100my. While our approach would certainly aide such projects, researchers should be aware of the error and bias inherent in such analyses. Fortunately, as the genomics era progresses available genomic reference species will increase. Second, a large class of genes will lack homology between species, and this will increase with divergence. Such orphan genes are likely to be involved in species-specific adaptations and potentially the most ecologically and evolutionary interesting aspects of a species transcriptome [see [42
] for a review]. Therefore, there is a high likelihood that insightful results reside in careful analysis of that part of the transcriptome that does not hit a proxy reference genome and for which no known biological function is established. Our analyses suggest that de novo
analysis of orphan genes will be most insightful when such genes are assembled to their full length. Careful examination of TA contigs for long open reading frames flanked by both 5’ and 3’ UTRs may prove useful for such assessment (e.g. looking for the polyadenylation signal). Overcoming the bias against studying orphan genes is a challenge facing the entire research community.
A third limitation arises due to variation in recent gene duplication events among individuals, commonly referred to as Copy Number Variants (CNV). When young, CNVs can be very difficult to detect in RNA-Seq data. When mapping RNA-Seq reads back to a full genome, which is usually derived from a single individual, differences among individuals in their CNV with reference to the genome can result in reads from several independent loci being mapped to a single locus, resulting in a spuriously inflated measure of single locus expression. This is certainly the case in the de novo
approach we use here to obtain whole gene expression levels, as the contigs we assign to the same gene may derive from incomplete transcript assembly as well as recent duplication events. Expression differences detected between biological groups using this whole gene approach necessarily must be studied in more detail, to assess the causal basis of the signal. We argue that this is true in both model and non-model systems alike, where there are likely to be significant differences between the scaffold genome and the individuals having their RNA sequenced. Finally, this is similarly true for splicing isoforms, as our whole gene approach pulls together expression across exons for the entire gene. To the extent that differences among groups arise from expression differences solely in specific exons, this will give rise to expression differences that necessarily must be investigated further to determine whether this difference is evenly distributed across the entire gene. A final point of importance is regarding the choice of transcriptome assembler. Many papers are still emerging where groups have used poorly performing assembly software to assemble their transcriptome data. Our results might not be obtainable with such software, especially as few programs handle hybrid data well. Thus we encourage researchers to be aware of the latest advances in transcriptome assembly and use methods shown to perform well with their generated data [37
]. In sum, our whole gene expression quantification provides a robust starting place for the identification of gene expression differences whose biological basis will require more detailed study, as should be common in any RNA-Seq study regardless of genomic resources.