The easyRNASeq package combines the following steps: reading in sequenced reads, retrieving annotations, summarizing read counts by the feature of interest, e.g. exon, gene and finally reporting results, normalized or not, in formats suitable for downstream analyses. This is achieved by using and extending many Bioconductor packages functionalities () and provided to end users as a single function that wraps the entire process.
Packages wrapped by easyRNASeq. At every step of the process, easyRNASeq encapsulates and extends lower level package functionalities, finally merging them into a single high-level function.: easyRNASeq
2.1 Reading data
Depending on the alignment format, manufacturer-specific (e.g. Illumina export) or the de facto BAM (Binary Alignment/Map) standard, the data are parsed by either ShortRead or Rsamtools, respectively. The coverage is extracted per base pair and divided by the read length—i.e. reads’ coverage proportion are reported per base pair—and stored in an ‘IRanges running length encoding’ (RLE) vector. This approach yields identical results to the common counting reads per se approach, when applied to non-spliced regions as shown in the ‘RNASeqTutorial’ vignette. However, it more accurately assigns reads spanning exon–exon junctions (EEJ), i.e. unlike common methods that arbitrarily select an EEJ side, reads coverage proportion is, here, distributed across the EEJ.
2.2 Loading annotations
Genic annotations are retrieved using ‘biomaRt’ (Durinck et al., 2005
) or read from ‘General Feature Format 3’ (GFF3) or ‘Gene Transfer Format’ (GTF) files using ‘genomeIntervals’. The annotation set is stored in a ‘RangedData’ object (IRanges). To reduce loading times, ‘RangedData’ or ‘GRangesList’ objects from the R environment or RData
(rda) files can be used, provided they describe exons, features, transcripts or genes (a ‘feature’ is the representation of a genomic locus, not necessarily genic, e.g. an enhancer).
2.3 Counting reads
The reads’ coverage is summarized according to the chosen features: exons, features, transcripts or gene models. Here, a gene model is defined as the set of non-overlapping loci (i.e. synthetic exons) that represent all the possible exons and untranslated regions of a gene. easyRNASeq is not limited to genic summarization only, e.g. promoter ‘features’ can be used to look for eRNAs (Kim et al., 2010
Four formats are offered: count table (the default), ‘CountDataSet’ (DESeq
object), ‘DGEList’ (edgeR
object) and ‘RNAseq’ (easyRNASeq
object). The count table reports raw counts or ‘reads per kb of feature per million reads’ (RPKM, Mortazavi et al., 2008
), if preferred. For DESeq
, an object of their respective class is returned. If desired, these would have been subjected to their respective normalization, in which case quality assessment (QA) plots are drawn to evaluate it. Finally, RNAseq
objects allow different count summarizations to be performed (e.g. by exon, by gene) on the same data without re-processing, a useful feature when first assessing a dataset.
Fetching annotation using ‘biomaRt’ takes about 10 min from an average network. Generating the gene models takes up to 15 min for large genomes—e.g. Homo sapiens. If these annotations are readily available, processing a 36-bp single-end ‘Illumina’ lane with 100 M reads—a BAM file of 3.2 GB—requires 3 GB RAM and 3 min on an Intel 2.4 GHz CPU.