There are now a variety of off-the-shelf workflows for analyzing gene expression with next-generation sequencing tools. These workflows are multi-step processes that differ in sample preparation, sequencing methods, and mapping tools. The direct comparison presented here provides the opportunity to examine the degree of congruence between three very different workflows and explore sources of incongruence. We chose workflows that provide the broadest perspective of possible differences, which is critical for comparing studies that use different methods and for selecting workflows for particular applications, at the expense of being able to unequivocally attribute all differences to particular steps within each workflow.
The preparation of our samples for Helicos sequencing gave atypically low cDNA yields and resulted in ectopic tag sites that were frequently observed to be displaced to the 3′. The exact reason for this unexpected outcome for these particular samples could not be determined. The results presented for Helicos here are therefore suboptimal and not typical for the performance of this workflow. Despite problems with sample preparation, however, the Helicos DGE results are still consistent across replicates and congruent with the other two workflows. The data are therefore quite robust to the problems encountered during sample preparation. The detection of ectopic tag sites indicates that it is important to check the physical distribution of mapped reads to verify that library preparation generated the expected products.
Comparative evaluation of read allocation across workflows () suggests overrepresentation of highly expressed genes by the SOLiD SAGE workflow. This could be due to overamplification in this particular sample set, though the use of only eight cycles of amplification suggests that other factors may be at play. The sample preparation kit has since been upgraded and the results presented here might not be indicative of current performance.
The application presented here is typical of that faced by investigators working on non-model organisms – specimens were collected in the field, and the gene reference sequences that were generated are incomplete. We found that the incompleteness of reference sequences explained the greatest fraction of differences between workflows in the ability to detect differential expression (DE). Improving reference completeness is critical to optimizing DE assessment.
The ratio between mapped reads and total number of reads might serve as a rough indicator for the degree of completeness of the transcriptome assembly. In this study 26.8% of the Illumina mRNA-Seq reads, which passed the filter, uniquely map to the gene reference. 26.7% of the reads, which passed the filter, are derived from ribosomal RNA. This leaves 46.5% of reads that do not map uniquely or do not map at all. Reads that do not map at all could be due to several causes, including genetic polymorphism between specimens resulting in multiple mismatches, sequencing errors, genes missing from the reference, and portions of genes missing from the reference (i.e., incomplete gene sequences). However, the fact that highly expressed genes contribute proportionally stronger to the pool of mapped reads complicates the interpretation of the ratio.
The Illumina mRNA-Seq workflow was the least sensitive to gene reference sequence completeness, and identified the greatest number of reference sequences with DE. In this study the tag-based protocols (Helicos DGE and SOLiD SAGE) detected DE for about half as many reference sequences. When only the subset of reference sequences that unambiguously include the 3′-most NlaIII site were considered, congruence across platforms was much greater and they all identified a similar set of genes with DE.
Each workflow identified a set of DE reference sequences which the other workflows did not detect (, Illumina mRNA-Seq: 117, SOLiD SAGE: 95, Helicos DGE: 59). We found that different workflows generate different distributions of mapped reads across reference sequences, with SOLiD having fewer reads than the other platforms for genes with low expression (). This could explain some of these residual differences in the ability to detect DE. Possible other sources of incongruence could include sequence composition effects leading to lower or higher counts on a particular platform, for example those caused by random hexamer biases 
There are at least two important implications of the sensitivity to reference completeness that we identify here. First, as the completeness of gene reference sequences improves, differences between workflows in the ability to detect DE will decrease. Second, when only an incomplete reference sequence is available, mRNA-Seq outperforms tag-based workflows. It is important to note that the decision between tag-based and mRNA-Seq workflows is not a decision between sequencing platforms, as mRNA-Seq sample preparation protocols are available for Illumina, SOLiD, Helicos, and other platforms.
Sequence composition, completeness, and length are properties of the gene reference, and will therefore have the same impact on all samples that are mapped to that reference. However, these reference-specific properties will complicate intergene comparisons, including comparisons between different genes in the same species and orthologs in different species. These challenges apply to some of the most intuitively appealing investigations of the evolution of gene expression, such as the evolution of expression of a gene in a particular tissue across a phylogeny.
In addition to the tissues or treatments under consideration, gene expression is also a function of environmental factors and of the genotype of the sampled organisms 
. Because we collected three pairs of nectophore and gastrozooid samples from three specimens, we were able to take into account the impact of differences across samples as well as differences between tissues when assessing differential expression. These analyses indicate that expression was highly consistent across specimens. This is consistent with the very low common dispersion in expression for this study. These results also indicate consistent mRNA harvest and high technical reproducibility for each sequencing workflows.
The hybrid design employed here, wherein long-read data are used to generate reference sequences and short-read data are used to quantify gene expression, provides a cost-effective strategy for analyzing differential gene expression in non-model organisms. With growing interest in comparative and ecological functional genomics, such studies will be increasingly common.