|Home | About | Journals | Submit | Contact Us | Français|
Summary: RNA-seq, the application of next-generation sequencing to RNA, provides transcriptome-wide characterization of cellular activity. Assessment of sequencing performance and library quality is critical to the interpretation of RNA-seq data, yet few tools exist to address this issue. We introduce RNA-SeQC, a program which provides key measures of data quality. These metrics include yield, alignment and duplication rates; GC bias, rRNA content, regions of alignment (exon, intron and intragenic), continuity of coverage, 3′/5′ bias and count of detectable transcripts, among others. The software provides multi-sample evaluation of library construction protocols, input materials and other experimental parameters. The modularity of the software enables pipeline integration and the routine monitoring of key measures of data quality such as the number of alignable reads, duplication rates and rRNA contamination. RNA-SeQC allows investigators to make informed decisions about sample inclusion in downstream analysis. In summary, RNA-SeQC provides quality control measures critical to experiment design, process optimization and downstream computational analysis.
Supplementary information: Supplementary data are available at Bioinformatics online.
RNA-seq is a highly parallelized sequencing technology that allows for comprehensive transcriptome characterization and quantification (Wang et al., 2009). As with all forms of parallelized sequencing, significant computational processing is required to unlock transcript abundance levels and other measures for biological interpretation (Garber et al., 2011). However, prior to the calculation of biologically relevant data such as transcript abundance, presence of novel isoforms and genotype identity, it is necessary to evaluate the performance of the RNA-seq experiment itself. Summary statistics and quality control scores provide insight into inherently complex data prior to downstream analysis.
Here we present RNA-SeQC, a metrics tool with application to two domains: experiment design and process optimization; and quality control prior to computational analysis. Metrics such as duplication rate, rRNA abundance, alignment rates, coverage continuity and correlation to reference expression profiles are highly informative during selection of experiment conditions and library construction methods (Levin et al., 2010). RNA-SeQC's multi-sample input feature allows for direct comparison across samples (Fig. 1). Additionally, a single-sample mode can be used to monitor samples on an ongoing basis to rapidly assess the quality of a particular sequencing run, and to monitor and optimize these measures in production over time and prior to downstream analysis. RNA-SeQC provides a suite of experiment quality measures, many of which are currently not provided by other available tools (Supplementary Material).
RNA-SeQC provides three types of quality control metrics: Read Counts, Coverage and Correlation. A list and description of these metrics is shown below. RNA-SeQC is compatible with any alignment method that produces a specification-conforming BAM file (Li et al., 2009), with flags properly set. For additional information, usage and software requirements, see the GenePattern help document provided as Supplementary Material 1. Metrics reports are provided in HTML for human consumption, as well as tab-delimited text files for pipeline integration.
The following metrics are generated by counting reads with particular characteristics. Rates are also provided, and are calculated as either per total reads or per aligned reads. Since the BAM format does support multiple alignments per read, this implementation ignores any read flagged as not being a primary alignment.
The following metrics are based on coverage: the number of reads that cover a given genomic position (in units of reads per base). RNA-SeQC quantifies the uniformity of coverage with several different metrics. To reflect the effect of expression level on these metrics, we select genes from three categories: low, middle and high expression genes (see Supplementary Material) and also report the average of these metrics for each gene set.
One of the most valuable ways to interpret the performance of an RNA-seq run is to compare the measured expression levels to a reference (Levin et al., 2010). RNA-SeQC provides RPKM-based estimation of expression levels (Mortazavi et al. 2008). When run with multiple samples, RNA-SeQC creates a matrix of correlations among all combinations, reporting the Spearman (rank based) and Pearson (quantity based) correlation coefficients. Optionally, an array based or RNA-seq reference expression profile can be provided for the correlation analysis. Correlation metrics are also provided for the different GC content stratifications to measure GC bias.
Implemented in Java, RNA-SeQC is platform independent and requires no installation. For investigators who prefer a web interface to a command-line tool, this software can be run using the GenePattern web interface found at http://www.GenePattern.org (Reich et al., 2006).
Within the RNA-SeQC software package, Read Count metrics were implemented by inheriting from the ReadWalker class of the GATK software package (McKenna et al., 2010). Transcript annotations are bound to the walker in the RefGen format. This format is created on-the-fly from a user-provided GTF file. The program is designed to support the minimal GTF specification, but the GTF format used by GENCODE (Harrow et al., 2006) is recommended. For continuity of coverage calculations, the GATK's Depth of Coverage walker was used to calculate the number of bases at a given position in the genomic alignment. Finally, ribosomal RNA quantification is performed by realigning all reads to rRNA reference sequences using the Burrows–Wheeler Aligner (Li and Durbin, 2009).
Funding: Funded in part with Federal funds from the National Human Genome Research Institute, National Institutes of Health, Department of Health and Human under Contract No. HHSN268201000029C.
Conflict of Interest: none declared.