APART represents a user-friendly bioinformatic tool for obtaining a full overview on the global transcriptome of a cell or an entire organism. Due to the contig-based analysis instead of profiling of the known genes, APART can be used for the identification of novel stable RNA species, including both intergenic transcripts and RNA processing products. Other benefits over existing pipelines include efficient handling of non-unique reads, a novel measure for transcript abundance assigned to the contigs and putative processing products and a convenient presentation of the results. These features are especially important for genome-wide ncRNA studies in higher eukaryotes since the recent past clearly revealed the ncRNA transcriptomes to be far more complex than initially anticipated (
1,
42).
The proposed workflow for detection of stable RNA species consist of two steps. First is the appropriate experimental preparation of the cDNA library enriched in functional stable RNAs. The enrichment methodology has been already well documented (
6–7,
43), thus the major challenge was the development of a novel computational tool for analysis of the deep-sequencing data. The key feature of the APART pipeline is detection of novel stable RNA molecules. Although the employed method is very simple, it follows the idea that stable RNA transcripts and processing products should be protected against endo- and exo-nucleases. Thus, the strict limitation of the exact 5′- and 3′-ends has been introduced. This is opposite to the method used in the blockbuster algorithm (
44), where reads with non-identical ends are joined into ‘blocks’. The blockbuster approach was however initially developed for separation of microRNA and microRNA* blocks of reads in order to enable the assignment of separate expression values. The major difference between our approach and microRNA profiling experiments is that read data obtained from microRNA profiling experiments contain almost no background reads derived from precursor hairpins (due to specific amplification of exclusively RNA processing products). In our dataset the amount of background reads was in many cases substantial, thus read distribution analysis proposed in blockbuster failed to separate the putative processing products.
The predominant limitations of the APART pipeline are related to the mapping procedure. Some classes of ncRNAs, like tRNAs, contain a number of post-transcriptional modifications, including modified nucleotides and non-encoded nucleotides, e.g. CCA at the 3′-ends. Especially nucleobase modifications can potentially lead to incorrect cDNA synthesis during RT. These putative cDNA errors would subsequently lead to additional mismatches during the alignment process to the reference genome. To address this issue, we have compared the number of the reads aligning to the most intensely modified RNA species—rRNA and tRNA allowing one, two, or three mismatches. It turned out that only a minor portion of the reads can be additionally aligned when the number of allowed mismatches is increased (
Supplementary Figure S4). Moreover, the observed difference is similar when hyper-modified RNA species (rRNA and tRNA) or total reads from the library are considered. Thus, we decided to allow only one mismatch as the default parameter for APART. However in higher eukaryotes, where the ratio of RNA modification is higher, the value should be adjusted accordingly. In cases of libraries composed predominantly from tRNAs or other hyper-modified RNA species, we suggest to use other alignment tools, like segemehl (
45) which allow also for insertions and deletions. This feature will allow not only more efficient handling of modified nucleotides, but also non-encoded CCA tails of tRNAs. In such cases, the work with APART would start from read alignment file in SAM format.
Also detection of the stable RNA can be potentially hampered by the experimental procedure employed for generation of the cDNA library. The main step causing potential bias is the amplification of the cDNA. During this procedure, a preference for short molecules is observed (
46). The unequal amplification can lead to multiplication of single cDNA molecules, resulting in a false sharp increase of the coverage of some genomic regions. Such cases can lead to false predictions of processing events. However, our experimental data suggest that such events are very rare, since the presence of all of the tested processing products has been experimentally confirmed.
One also has to keep in mind that not every sharp shift in read coverage is related to an RNA processing event. It could also be caused by an RNA structure-dependent drop-off of the reverse transcriptase (
47) or by preferential amplification of some of the short cDNA sequences during library generation. However, such cases cannot be distinguished based on the sole analysis of cDNA sequences.
In order to estimate the performance of the APART pipeline, we have used it for the analysis of a previously published dataset. For this purpose we have used the small RNA library generated by Kawaji
et al. (
4) in which numerous types of RNA processing products were observed. APART was able to detect and annotate the processing products using a fully automated mode in a similar way (). The only remarkable exceptions were miRNAs. In Kawaji
et al., 821 contigs corresponding to miRNAs have been identified, whereas the default APART analysis resulted in annotation of only 230 miRNAs. Such a high difference could arise from the different approach for analysis. The authors of the original work used a hierarchical mapping of the reads to different ncRNA classes, using the threshold of 80% sequence identity. During such an approach, reads are aligned to different classes of transcripts not simultaneously, but in a specified order. Reads mapped to the first category are not considered for downstream categories. As a result, reads which could map to downstream transcript types with higher identity can be assigned to a false category and lead to overestimation of the categories placed in the beginning of the list. In contrast, APART by using the reference genome for read mapping always picks the best aligning genomic loci, resulting in a more unbiased analysis. Additionally, implemented in APART clustering of the redundant contigs derived from multiple alignments of the same stets of reads lead also to a reduction of the final number of the reported contigs.
| Table 1.Assessment of the APART performance in comparison to data published by Kawaji et al. (4) |
A high number of novel processing products and novel intergenic ncRNAs suggested by the analysis of cDNA library constructed from ribosome-associated small RNAs reveal the potential of the presented methodology. Due to the APART features we were able to predict and experimentally verify the differential and stress-dependent processing of tRNAs, rRNAs and snoRNAs. Furthermore, our data suggest that these ncRNA processing products are associated with yeast ribosomes under different environmental growth conditions. Beside the presented yeast cDNA library, APART has also been successfully applied on archaeal, mouse and human ncRNA libraries (M.Z., N.P., unpublished data), as well as on libraries generated by genomic SELEX or CLIP (M.Z., Renee Schroeder, Andrea Barta, unpublished data) approaches employing complex eukaryal model organisms (M.Z., Alexander Hüttenhofer, unpublished data). This emphasizes the general potential of APART for efficient de novo assembly and annotation of short read libraries.