2.1 MRF and converters
MRF only stores a minimal set of information, i.e. information that cannot be derived from the MRF data itself. This has the advantage of keeping the format succinct, while still capturing the relevant information for most analyses. MRF consists of three components: comment lines (optional) denoted by a leading ‘#’ sign, a header line and the mapped reads. The header line specifies the data type of each column: AlignmentBlocks, Sequences, QualityScores and QueryID. The column type AlignmentBlocks is required and represents the mapped reads. Each alignment block contains the coordinates with respect to the reference genome to which the read aligns as well as the read coordinates. A read spanning multiple regions, e.g. multiple exons, is denoted by multiple alignment blocks that are separated by a comma. Paired-end reads can be represented by using a set of alignment blocks for each end, which are separated by the ‘|’ symbol. By using this format, it is straightforward to specify both gapped and paired-end alignments. The RSEQtools package includes various utilities to convert the output of several mapping tools into MRF. A converter for the commonly used SAM format is included as well. The first example below represents two paired-end reads where one end is spliced, whereas the second example shows two un-spliced single-end reads with their associated QueryIDs:
The optional types Sequences, QualityScores and QueryID provide additional information. In particular, the confidentiality issues can be addressed by generating two files: one including the alignments and a second one containing the sequences such as a FASTQ file. The former is useful for most analyses and can be publicly shared because it does not contain confidential information, whereas the latter can be subjected to a higher level of security and control. The two files can be conjoined, if necessary, by using the common QueryID as shown in .
2.2 RNA-Seq analysis with RSEQtools
The RSEQtools suite contains a set of modules to perform a large variety of tasks including the quantification of expression values, manipulation of gene annotation sets, visualization of the mapped reads, generation of signal tracks, the identification of transcriptional active regions and several auxiliary utilities (Supplementary Table S1).
Genome annotation tools: to generate a splice junction library from any annotation set, we extract the genomic sequences of all the exons and synthetically create all splice junctions specified in the annotation set. This splice junction library can be used in combination with the reference sequences. A second tool is particularly useful when estimating expression levels. In order to capture the information of the various transcript isoforms, a ‘gene model’ is required. The module mergeTranscripts collapses the transcript isoforms into a single gene model by either taking the union or intersection of the exonic nucleotides.
Quantification of gene expression:
one of the key features of RNA-Seq is the quantification of expression at different levels. Hence, a key module calculates the gene expression values for a given annotation set and a collection of mapped reads in MRF format. The annotation set specifies which ‘elements’ will be quantified. The program mrfQuantifier
calculates RPKM (reads per kilobase per million mapped reads) values at the nucleotide level (Mortazavi et al., 2008
). Briefly, for a given entry in the annotation set (typically an exon or gene model), the number of nucleotides from all the reads that overlap with this annotation entry are added up and then this count is normalized by sequence length of the annotation entry (per killobase) and by the total number of mapped nucleotides (per million). This calculation is not performed at the transcript level, which requires a more sophisticated analysis (Guttman et al., 2010
; Trapnell et al., 2010
Visualization of mapped reads: the RSEQtools package also contains various tools for visualizing the results in genome browsers, by means of wiggle (WIG) and bedGraph files, which are commonly used to represent a signal track of mapped reads. Also, a GFF file can be generated from MRF files to visualize splice junction reads (example in ).
Identification of transcriptionally active regions (TARs)
: transcribed regions can be identified de novo
by performing a maxGap/minRun segmentation (Kampa et al., 2004
; Royce et al., 2005
) from the signal files using the wigSegmenter
program. Briefly, the signal is first thresholded to identify transcribed elements. Contiguous elements whose distance is less than ‘maxGap’ are joined together and then filtered if the final size is less than ‘minRun’. This type of analysis is particularly useful in discovering novel TARs such as small RNAs, etc.
MRF selection and auxiliary utilities: lastly, RSEQtools includes a set of utilities to easily manipulate MRF files and a collection of format conversion tools allowing for rapid pipeline development.
Implementation and run time: the modules of the RSEQtools suite were implemented in C and the code was optimized in order to efficiently handle large datasets. The importance of code scalability cannot be overemphasized in a time where datasets become increasingly large and easily exceed several gigabytes. For example, the conversion of an ELAND export file (uncompressed file size: ~4 GB; total number of reads: ~20 million; number of mapped reads: ~12 million) to MRF takes ~2 min and the resulting MRF file is significantly smaller (~400 MB uncompressed, ~130 MB compressed with gzip). Converting the same ELAND export file to SAM generates a file of ~3.1 GB (uncompressed) and the corresponding BAM file has a size of ~1.2 GB. The subsequent quantification of gene expression using mrfQuantifier requires 45 s to calculate estimates for about 20 000 genes.
In addition, the modularity of RSEQtools also enables the development of additional programs in any programming language and their seamless integration into this framework. Finally, most modules use STDIN and STDOUT to process the data, making them suitable to be integrated into an automated pipeline.