R-SAP is a bioinformatics tool for the processing and analyses of the high-throughput RNA-Seq data that integrates reference genome alignments of sequencing reads with known transcripts models.
Using three publically available datasets (MAQC, ENCODE and ChimerDB 2.0) to evaluate different modules of the pipeline, we have shown that R-SAP can systematically detect novel transcriptional events including various classes of RNA isoforms and other transcript structures such as intra-genic and inter-genic chimeras. R-SAP's performance in categorizing transcripts represents a significant improvement over currently available pipelines as exemplified by Trans-ABySS and Cufflinks/Cuffcompare. Moreover, R-SAP's RNA expression level estimates are highly correlated with independent gene-expression microarray analyses and experimentally derived qRT–PCR measurements. Currently, R-SAP simply excludes multi-hit reads from further analysis because they cannot be assigned to unique genomic loci. We expect a significant improvement in R-SAP's expression estimates once bias-correction and multi-hit read re-distribution methods are included in R-SAP's future releases.
R-SAP's ability to accurately detect alternative splicing and chimeric transcripts is optimal for sequencing reads >40–50
bp. We do not consider this to be a significant shortcoming given that most current and envisioned sequencing methodologies do or soon will generate read lengths well above this threshold (47
). R-SAP's characterizations of sequencing reads are also dependent on the choice of the reference set of the transcripts. In our test analyses, we conservatively used RefSeq transcripts as our reference set. We believe that characterization can further be improved by using a more informative, non-redundant and inclusive set of all established transcript models such as UCSC, Ensembl, RefSeq and AceView (18
One of our major goals in constructing R-SAP was to develop a pipeline that can be fine-tuned according to the nature of the data. We sought to achieve this goal by incorporating various user adjustable cutoffs in the workflow that can be used to alter the stringency of each analysis. For example, in case of poor quality of the reference genome or lower quality sequencing reads, a high rate of mismatches and small gaps can be compensated for by lowering the coverage, identity and/or deletion cutoff values. Similarly, for poorly annotated exon boundaries where alignments may extend slightly beyond the edge of the exon, the exon-extension, the cutoff can be increased accordingly to accommodate for alignment errors at exon boundaries.
The characterization of transcriptomes using RNA-Seq is a multi-faceted problem that includes cataloguing of coding and non-coding transcripts, uncovering and characterization of novel RNA isoforms and chimeric transcripts, detection of new splice-sites, discovery of new transcriptional structures, measurement of RNA expression levels and estimation of RNA isoforms specific expression levels (11
). We hope that R-SAP will prove useful as a user-friendly bioinformatics tool to compliment more specialized programs in the quantitative and qualitative analysis of RNA-Seq data.