Recent advances in massively-parallel cDNA sequencing (RNA-Seq) have opened the way for comprehensive analysis of any transcriptome
1. In principle, RNA-Seq allows us to study all expressed transcripts, with three key goals: first, annotating the structures of all transcribed genes including their 5’ and 3’ ends and all splice junctions
2–4; second, quantifying the level of expression of each transcript
5,6; and third, measuring the level of alternative splicing
7–11.
Standard libraries for RNA-Seq do not preserve information about which strand was originally transcribed. Synthesis of randomly primed double-stranded cDNA followed by addition of adaptors for next-generation sequencing leads to the loss of information about which strand was present in the original mRNA template. In some cases, strand information can be inferred by subsequent computational analyses, using, for example, open reading frame (ORF) information in protein coding genes, biases in coverage between 5’ and 3’ ends
4, or splice site orientation in eukaryotic genomes
4,10,11.
Nevertheless, direct information on the originating strand can substantially enhance the value of an RNA-Seq experiment. For example, such information would help to accurately identify antisense transcripts, with potential regulatory roles
12, determine the transcribed strand of other non-coding RNAs, demarcate the exact boundaries of adjacent genes transcribed on opposite strands, and resolve the correct expression levels of coding or non-coding overlapping transcripts. These tasks are particularly challenging in small microbial genomes, prokaryotic and eukaryotic, where genes are densely coded, with overlapping UTRs (untranslated regions) or ORFs, and where splice site information is limited or non-existent.
A host of methods has been recently developed for strand-specific RNA-Seq (), that fall into two main classes. One class relies on attaching different adaptors in a known orientation relative to the 5’ and 3’ ends of the RNA transcript (). These protocols generate a cDNA library flanked by two distinct adaptor sequences, marking the 5’ end and the 3’ end of the original mRNA respectively. A second class of methods relies on marking one strand by chemical modification, either on the RNA itself by bisulfite treatment () or during second-strand cDNA synthesis followed by degradation of the unmarked strand (). Both modification methods essentially follow the standard protocol for RNA-Seq with the exception of these marking steps.
While standard RNA-Seq largely relies on one protocol, the great diversity of published protocols for strand-specific RNA-Seq poses several challenges. First, when conducting an experiment, researchers are challenged to identify a suitable protocol. Furthermore, if protocols vary considerably in their performance, the chosen method can dramatically affect the conclusions drawn from an experiment, confounding interpretation and comparison across studies. There is therefore a substantial need for a systematic evaluation of the performance of different protocols for strand-specific RNA-Seq.
Here, we present a comprehensive comparison of seven protocols for strand-specific RNA-Seq. Using S. cerevisiae polyA+ RNA, we built a compendium of libraries using these protocols () and Illumina sequenced each of them to deep coverage. We developed a computational pipeline to assess each library’s quality according to library complexity, strand specificity, evenness and continuity of coverage, agreement with known genome annotation, and quantitative accuracy for expression profiling, in addition to considering the ease of laboratory and computational manipulations. We identify the dUTP and Illumina RNA ligation methods as the leading protocols, with the dUTP library providing the added benefit of the ability to conduct paired-end sequencing.