|Home | About | Journals | Submit | Contact Us | Français|
RNA-seq technologies are now replacing microarrays for profiling gene expression. Here we describe a robust RNA-seq strategy for multiplex analysis of RNA samples based on deep sequencing. First, an oligo-dT linked to an adaptor sequence is used to prime cDNA synthesis. Upon solid phase selection, second strand synthesis is initiated using a random primer linked to another adaptor sequence. Finally, the library is released from the beads and amplified using a bar-coded primer together with a common primer. This method, referred to as Multiplex Analysis of PolyA-linked Sequences (MAPS), preserves strand information, permits rapid identification of potentially new polyadenylation sites, and profiles gene expression in a highly cost effective manner. We have applied this technology to determine the transcriptome response to knockdown of the RNA binding protein TLS, and compared the result to current microarray technology, demonstrating the ability of MAPS to robustly detect regulated gene expression.
RNA-seq has the obvious advantage over microarray in profiling gene expression, in that it generates digital information of individual annotated genes with literally unlimited dynamic range. In addition, RNA-seq has the ability to comprehensively detect novel transcripts and mRNA variants resulting from alternative promoter usages, splice sites, and polyadenylation.
Many RNA-seq protocols have been developed in recent years (for review, see [1, 2]). Standard RNA-seq procedure uses fragmented RNA to prepare dsDNA followed by adaptor ligation . This method minimizes a bias towards the 3′ end of genes by converting poly(A+) RNA to cDNA. However, this approach provides no strand information, which is crucial for detection of anti-sense transcripts from genes within genes or transcription within overlapping genic regions in opposite directions. To overcome these shortcomings, various strategies have been developed to preserve the strand information, including (1) the use of different adaptors at the 5′ and 3′ ends [4, 5], (2) 3′ end polyA tailing , (3) double-random priming with distinct adaptors [7, 8], and (4) dUTP marking of the 2nd strand followed by selective degradation of the strand after linker ligation in order to sequence only the 1st strand .
While these methods have the ability to detect structural variations in mRNA, the counts generated depend on both abundance and length of individual transcripts. As a result, rare but long transcripts are more easily quantified than rare but short transcripts; the latter would require a much higher overall tag density to detect. For the purpose of gene expression profiling, the alternative approach focuses on the 3′ end of each gene which, like the original SAGE technology, is referred to 3′-tag digital gene expression [10, 11]. Key steps of the method include cleavage of dsDNA with a frequent restriction enzyme, adaptor ligation, and affinity purification of biotinylated oligo-dT initially used to prime cDNA synthesis. It has been demonstrated that this approach is more robust in detecting low abundant mRNAs . Another obvious advantage of this method is its ability to systematically identify polyadenylation sites, which has been applied to C. Elegans .
Here we describe a much simpler version of the 3′-tag digital gene expression approach, which we refer to as Multiplex Analysis of PolyA-linked Sequences (MAPS). This method, which is modified from our original double-random priming strategy , uses biotinylated oligo-dT directly linked to a specific sequencing adaptor to prime cDNA synthesis. This is followed by second strand synthesis using a random primer attached to a second adaptor. Using this technology, we detected >10,000 previously unannotated polyadenylation sites in HeLa cells and characterized the transcriptional response to knockdown of the Pol II-associated RNA binding protein, TLS in comparison with microarray. Our analysis has demonstrated the robustness of MAPS in studying regulated gene expression.
Our strategy, depicted in Fig. 1, employs an oligo-dT linked to a specific sequencing primer (B) corresponding to the P7 sequence anchored on the surface of Illumina flowcells to prime cDNA synthesis. The 3′ end of the primer contains two random nucleotides in order to ensure cDNA synthesis from the beginning of the poly(A) tail. The primer also carries a biotin moiety for solid phase selection. Excess oligo-dT is removed after cDNA synthesis and the 3′ end of cDNA is blocked by using terminal transferase in the presence of ddNTP. Following biotin selection, second strand synthesis is performed directly on magnetic beads using a random primer (8X) linked to a specific sequencing adaptor. After washing, the second strand is released from the beads by heat and eluted products are amplified using Primer B in combination with a bar-coded primer. The bar-coded primer consists of the Illumina sequencing primer (A), which enables the amplified sequence to anneal to the P5 primer on the Illumina flowcell, followed by a 4-nucleotide coded region (address), and the sequencing adaptor. The PCR products, while approximately within the same size region (see below), are size selected from acrylamide gel to minimize short fragments, quantified, and loaded onto an Illumina flowcell for deep sequencing.
Two rounds of sequencing are performed, starting with the insert, using the sequencing adaptor for 32 to 34 cycles of sequencing. The products are next stripped off the surface of the flowcell and a second round of sequencing is performed using primer A to identify the bar-coded region. In our hands, both sequencing reactions can be accomplished with the standard 36 cycle kit (because of extra sequencing reagents included in the kit). The sequences of bar-coded regions allow the assignment of individual samples in a multiplex run. We have successfully decoded >99% of the sequences using this method.
We build in two key features in the MAPS strategy. First, second strand synthesis is initiated by simple random priming. This method is much simpler than dsDNA synthesis followed by restriction digestion and linker ligation, which is currently used in the 3′-tag digital gene expression protocol. Although in theory random priming can take place anywhere on the first strand cDNA, the following PCR reaction ensures biased amplification of short fragments from the 3′ end of expressed genes.
The second feature of the MAPS strategy is obviously the ease of multiplexing samples using individual bar-coded primers. Our strategy is distinct from standard protocols using bar-coded adaptors during library construction because it ensures all library construction is performed under uniform conditions without the concern of differences in quality among different bar-coded adaptors. In our applications, we routinely multiplex 6 reactions, typically 3 biological repeats of control and 3 treated samples in one lane of the Illumina GAII Analyzer. Multiplexing 6 reactions in a single lane permits the use of simple statistical analysis between differentially expressed genes. This level of multiplexing also achieves the economy of genome-wide analysis, as the cost for transcriptome analysis is ~$100 per sample, which is comparable to current microarray platforms. In future applications using HiSeq2000, it should be possible to multiplex 18 or more samples per lane, thus dramatically reducing the cost of gene expression profiling to a mere $35 per sample.
We first applied the MAPS strategy to total RNA isolated from HeLa cells, demonstrating the high reproducibility of this approach (Fig. 2A). From three biological repeats, we obtained ~15M tags, among which ~10M were uniquely mapped to the human genome (the control raw in Table 1). We removed potential internal priming events , resulting in ~8.3M tags, which were used for further analysis. As expected, the tags were clustered at the 3′ end of genes, ~300nt upstream of the polyA site (Fig. 2B). The two gene examples provided clearly indicate that MAPS is able to detect both unique and alternative polyadenylation sites (Fig. 2C and 2D).
We next clustered the tags based on annotated polyA sites, resulting in 8,859 clusters (Table 2, control). By scanning against 13 polyA signal variant motifs , we found that 7547 (85.2%) clusters contain a known polyA signal within the cluster, with remaining 1312 (14.7%) clusters lacking a canonical polyA signal. This rate is similar to that found in C. elegans . Strikingly, we also detected 12,532 potential new polyA sites in the human genome, ~80% of which are also associated with a polyA signal (Table 2). This observation suggests that there are a huge number of polyA sites, either within known genes or associated with new transcripts, that remain to be annotated in the human genome. Therefore, our method can be used not only to detect those signals, but also analyze differential polyadenylation in future studies.
We next applied the MAPS strategy to profile gene expression in response to knockdown of FUS/TLS in HeLa cells. FUS/TLS is an RNA binding protein originally identified as a frequent chromosome translocation partner in liposarcomas [14–19]. Most recently, FUS/TLS was linked to Amyotrophic Lateral Sclerosis (ALS) [27, 28]. We used RNAi to achieve efficient knockdown of FUS/TLS in HeLa cells (Fig. 3A). Both control and specific RNAi-treated samples were subjected to library preparation according to MAPS followed by deep sequencing in a single lane on an Illumina flowcell. After filtering tags likely resulting from internal priming, we obtained ~8M uniquely mapped tags under FUS/TLS knockdown conditions. This result is comparable to those from control siRNA-treated cells (Table 1). Similar to control cells, FUS/TLS knockdown samples were highly reproducible when comparing between biological repeats (data not shown), and again, more than half of tag clusters were mapped outside known polyA sites (Table 2). By comparing between control and specific RNAi-treated cells, we identified several hundred genes that were either up- or down-regulated in response to in vivo depletion of FUS/TLS (Fig. 3B).
In order to validate the data generated by MAPS on the global scale, we performed microarray analysis on the same sets of samples from control and FUS/TLS RNAi-treated cells (Fig. 4A). We observed a similar pattern to previously documented comparisons of RNA-seq and microarray platforms . As noted before, microarray intensity appears larger than RNA-seq counts, likely due to elevated background on the microarray. We next compared differentially regulated genes, demanding FDR=<0.05 and fold change (FC)=>2 (Fig. 4B). While the two platforms generally agree with one another in terms of detecting similar trend changes for both up- and down-regulated genes, it is clear that a large number of genes that scored quantitative differences failed to show differences on the microarray platform (green dots). As a result, the RNA-seq data revealed more quantitative differences than the microarray (Fig. 4C). We also noted that, although there were a number of genes (104) that were only identified as differentially expressed according to the microarray (blue dots), those genes tended to show differences slightly above the fold cut-off (Fig. 4B). Previous analyses show similar patterns of agreement and discrepancy between microarray and RNA-seq platforms . We have also tested the reproducibility and multiplex capacity of MAPS technology. To do this, we compared the same samples and multiplexed three or six samples at once in a lane (Supplementary Fig. 1). The number of reads per gene is slightly reduced when multiplexing at a higher level, but importantly, the relative amount of gene is virtually identical. The ability of MAPS to detect more differentially expressed genes than microarray suggests that the digital method is more suitable for quantitative analysis of regulated gene expression. Further, the ability of MAPS to effectively multiplex with six samples in a highly reproducible manner illustrates the power and monetary benefit of this technology.
We describe a multiplex strategy for profiling gene expression. The procedure is much simpler than existing 3′-tag digital gene expression analyses and is highly cost effective compared to microarray-based approaches. The MAPS strategy provides strand information, thus permitting accurate assignment of sequence tags from the transcription of both sense and antisense strands within many gene loci. In addition to quantitative analysis of gene expression, our method can also be applied to detect new polyadenylation sites. We suggest a large of number of these sites identified have yet to be annotated. Although, some of these new polyA-containing transcripts correspond to non-coding RNAs expressed at low levels, their biological importance is appreciated. We have applied this technology to study regulated gene expression by FUS/TLS, a RNA binding protein implicated in diverse biological processes, illustrating that this method can be applied to study a multitude of different specific conditions. MAPS is a simple technique that enables high quality, accurate and digital gene expression data at an economical cost.
HeLa cells cultured in DMEM plus 10% FBS were transfected using Lipofectamine 2000 (Invitrogen) with 50pmol/ml of either TLS/FUS siRNA 5′-ACAGCCCATGATTAATTTGTA-3′ or control siRNA for 64 hours. The cells were subsequently harvested using Trizol followed by an Ambion RNA easy clean up. The resulting RNA was split into two samples, one for microarray analysis and the other for RNA-seq.
Total RNA (1–3ug) was added to 1ul of 50uM Bio-B-TNN (Biotin-5′-CAAGCAGAAGACGGCATACGAGTTTTTTTTTTTTTTTTTTTTNV-3′, note that the sequence at the 5′ portion corresponds to Primer B or P7 on the Illumina flowcell and that V in the last position represents A, G, or C, but not T), 1ul of 10mM dNTPs and distilled water to a total of 13ul. The mixture was heated for 5 minutes at 65°C, and subsequently incubated on ice for at least 1 minute. After the samples were cooled, 4ul of 5X First-strand buffer (Invitrogen), 1ul of 0.1M DTT, 1ul RNase Inhibitor (Invitrogen), and 1ul Superscript III RT (Invitrogen) were added to the reaction. The reaction was incubated at 50°C for 60 minutes. To inactivate the reaction, 80ul of water was added and the sample was heated at 70°C for 15minutes. To concentrate and remove free biotinylated oligos, the reaction along with 500ul of Qiagen PCR purification buffer (Qiagen PCR Purification Kit Cat # 28104) was added to the column. After washing the Qiagen column once with the binding buffer and twice with wash buffer, the cDNA was then eluted with 50ul of Qiagen elution buffer (Tris-HCl pH 8.5).
The eluted 50ul sample was added to a mixture of 15ul of 10X terminal transferase buffer (NEB cat #M0315S), 3ul of 130uM ddNTP mix, 80 ul of distilled water, and 2ul of terminal transferase (NEB cat #M0315S) to block the first strand. The reaction was incubated at 37°C for 1hr. EDTA was added to the final concentration of 20mM to stop the reaction. To the resulting primer extension product, 5ul Streptavidin-coated magnetic beads was added (Invitrogen (Dynabeads My One Streptavidin C1, Cat #65001), and the sample incubated at room temperature for 20 minutes. The collected beads were washed first with 100ul NaOH (0.1M) by pipetting up and down, and incubated for 5 minutes at room temperature. The beads were washed with water twice, then 1ul of sequencing Primer A-Random (100uM) (5′-GCTGATGCTACGACCACAGGNNNNNNN-3′, note that the specific sequence at the 5′ portion corresponds to the primer for sequencing on the Illumina flowcell) was added to the beads along with 5ul of 10X standard Taq DNA polymerase buffer (NEB), 1ul 10mM dNTPs, and water up to 49ul. The reaction was initiated by adding 1ul of Taq DNA polymerase (5ul, NEB) and incubated at 25°C for 1hr, 72°C for 30 seconds to extend, and then 75°C for 5 minutes. EDTA was added up to a final concentration of 10mM to stop the reaction. The beads were washed twice with 150ul of wash buffer (10mM Tris, pH 7.5, 1mM EDTA, 0.1% Tween-80 or Triton-X 100) and re-suspended in 20ul water and heated for 5 minutes at 95°C to release the products extended from the random primer.
The released DNA was amplified for 18 cycles using a common primer (Primer B: 5′-CAAGCAGAAGACGGCATACGAG-3′) in combination with a bar-coded PCR primer (5′-AATGATACGGCGACCACCGAGA-TXXXX-GCTGATGCTACGACCACAGG-3′, note that the specific sequence at the 5′ portion corresponds to Primer A or P5 on the Illlumon flowcell and that the specific sequence after the bar-code region corresponds to the sequencing primer).
The PCR products were precipitated with EtOH in the presence of glycogen (1:100), and resuspended 10ul water. The PCR products were fractionated on a 10% acrylamide gel. DNA fragments in the 150–350nt range were cut out, eluted with the elution buffer (10mM Tris-pH7.5, 1mM EDTA, 0.1% Triton X-100), and subjected to sequencing on the Illumina GAII Gene Analyzer.
The analysis was done by using in-house made scripts, which used bx-python (http://bx-python.trac.bx.psu.edu), kent source , and BEDTools . Fastq files were aligned with Bowtie  to merged transcript sequences from the UCSC Known Gene Database [33, 34]. The parameter for the mapping program Bowtie was “--solexa1.3-quals -l25 -n2 -e 200 -m1 --best --strata --trim5 4” . The mapping started at the 5th position of each tag to increase higher rate and up to 2 mismatches were allowed. The unmapped tags in the first round were then mapped to the genome (hg18). The internal priming was checked for each tag in its 300nt downstream sequence to see whether there was at least one polyA stretch (consecutive 8 As or 9 As in a 10nt window) . Differential expression between TLS knock-down and control was assessed using the edgeR package in R .
Total RNA was isolated from control and TLS knockdown samples in triplicate and applied to Illumina BeadChip (HumanHT-12 v4) containing ~48K probes. Array hybridization was performed at the UCSD Core facility according to manufacturer’s protocol. Quantile normalization was performed in R using the lumi package . Differential expression was measured in R using the limma package . A False Discovery Rate of 0.05 was applied to the resulting p-values to correct for multiple-hypothesis testing.
Data from RNA-seq and microarrays were merged on the Gene Symbol. For comparison of control gene expression, 1 was added to the RNA-seq counts before taking the log in order to avoid log of 0. For comparison of differentially expressed genes, we limited the analysis to genes that had at least one tag in one replicate under each condition.
We thank members of the Fu lab for technical help and stimulating discussion during the course of this study. We acknowledge early contribution of Ravi Dinakar to this project. This work was supported by a training grant (PI: Dan Donoghue) from NCI to KFW (T32 CA009523) and NIH grants (GM049369 and HG004659) to XDF.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.