The Oases assembly process, explained in detail below and illustrated in , consists of independent assemblies, which vary by one important parameter, the hash (or k-mer) length. In each of the assemblies, the reads are used to build a de Bruijn graph, which is then simplified for errors, organized into a scaffold, divided into loci and finally analyzed to extract transcript assemblies or transfrags. Once all of the individual k-mer assemblies are finished, they are merged into a final assembly.
Fig. 1. Schematic overview of the Oases pipeline: (1) Individual reads are sequenced from an RNA sample; (2) Contigs are built from those reads, some of them are labeled as long (clear), others short (dark); (3) Long contigs, connected by single reads or read-pairs (more ...)
2.2 Contig assembly
The Oases pipeline receives as input a preliminary assembly produced by the Velvet assembler (Zerbino and Birney, 2008
) which was designed to produce scaffolds from genomic readsets. Its initial stages, namely hashing and graph construction can be used indifferently on transcriptome data. We only run these stages of Velvet to produce a preliminary fragmented assembly, containing the mapping of the reads onto a set of contigs.
However, the later stage algorithms, Pebble and Rock Band, which resolve repeats in Velvet, are not used because they rely on assumptions related to genomic sequencing (Zerbino et al., 2009
). Namely, the coverage distribution should be roughly uniform across the genome and the genome should not contain any branching point. These conditions prevent those algorithms from being reliable and efficient on RNA-seq data.
2.3 Contig correction
After reading the contigs produced by Velvet, Oases proceeds to correct them again with a set of dynamic and static filters.
The first dynamic correction is a slightly modified version of Velvet's error correction algorithm, TourBus. TourBus searches through the graph for parallel paths that have the same starting and end node. If their sequences are similar enough, the path with lower coverage is merged into the path with higher coverage, irrespective of their absolute coverage. In this sense, the TourBus algorithm is adapted to RNA-seq data and fluctuating coverage depths. However, for performance issues, the Velvet version of TourBus only visits each node once, meaning that it does not exhaustively compare all possible pairs of paths. Given the high coverage of certain genes, and the complexity of the corresponding graphs, with numerous false positive paths, it is necessary for Oases to exhaustively examine the graph, visiting nodes several times if necessary.
In addition to this correction, Oases includes a local edge removal
. For each node, an outgoing edge is removed if its coverage represents <10% of the sum of coverages of outgoing edges from that same node. This approach, similar to the one presented by Yassour et al. (2011
), is based on the assumption that on high coverage regions, spurious errors are likely to reoccur more often.
Finally, all contigs with less than a static coverage cutoff (by default 3×) are removed from the assembly. The rationale for this filter is that any transcript with such a low coverage cannot be properly assembled in the first place, so it is expedient to remove them from the assembly, along with many low coverage contigs created by spurious errors.
2.4 Scaffold construction
The distance information between the contigs is then summarized into a set of distance estimates called a scaffold
, as described in (Zerbino et al., 2009
). Because a read in a de Bruijn graph can be split between several contigs, the distance estimate for a connection
between two contigs can be supported by both spanning single reads or paired-end reads.
The total number of spanning reads and pair-end reads confirming a connection is called its support. A connection which is supported by at least one spanning read is called direct, otherwise, it is indirect.
Connections are assigned a total weight. It is calculated by adding 1 for each supporting spanning read and a probabilistic weight for each spanning pair, proportional to the likelihood of observing the paired reads at their observed positions on the contigs given the estimated distance between the contigs and assuming a normal insert length distribution model.
2.5 Scaffold filtering
Much like the contig correction phase, several filters are applied to the scaffold: static coverage thresholds for the very low coverage sequences and a dynamic coverage threshold that adapts to the local coverage depth.
Because coverage is no longer indicative of the uniqueness of a sequence, contig length is used as an indicator. Based on the decreasing likelihood of high identity conservation as a function of sequence length (Whiteford et al., 2005
), contigs longer than a given threshold [by default (50+k−1) bp] are labeled as long
and treated as if unique and the other nodes are labeled as short
Connections with a low support (by default 3× or lower) or with a weight <0.1 are first removed. Two short contigs can only be joined by a direct connection with no intermediate gap. A short and a long contig can only be connected by a direct connection.
Finally, connections between long contigs are tested against a modified version of the statistic presented in (Zerbino et al., 2009
), which estimates how many read pairs should connect two contigs given their respective coverages and the estimated distance separating them (see Supplementary Material
). Indirect connections with a support lower than a given threshold (by default 10% of this expected count) are thus eliminated.
2.6 Locus construction
Oases then organizes the contigs into clusters called loci, as illustrated in . This terminology stems from the fact that in the ideal case, where no gap in coverage or overlap with exterior sequences complicate matters, all the transcripts from one gene should be assembled into a connected component of contigs. Unfortunately, in experimental conditions, this equivalence between components and genes cannot be guaranteed. It is to be expected that loci sometimes represent fragments of genes or clusters of homologous sequences.
Scaffold construction takes place in two stages similarly to the approach described by Butler et al. (2008
). Long contigs are first clustered into connected components. These long nodes have a higher likelihood of being unique, therefore it is assumed that two contigs which belong to the same component also belong to the same gene. To each locus are added the short nodes which are connected to one of the long nodes in the cluster.
2.7 Transitive reduction of the loci
For the following analyses to function properly, it is necessary to remove redundant long distance connections, and retain only connections between immediate neighbors, as seen in . For example, it is common that two contigs which are not consecutive in a locus are connected by a paired-end read.
A connection is considered redundant if it connects two nodes that are connected by a distinct path of connections such that the connection and the two paths have comparable lengths. The transitive reduction implemented in Oases is inspired from the one described in (Myers, 2005
) but had to be adapted to the conditions of short read data. In particular, short contigs can be repeated or even inverted within a single transcript and form loops in the connection graph. Because of this, occasional situations arise where every connection coming out of a node can be transitively reduced by another one, thus removing all of them, and breaking the connectivity of the locus. To avoid this, a limit is imposed on the number of removed connections. If two connections have the capacity to reduce each other, the shortest one is preserved.
2.8 Extracting transcript assemblies
The sequence information of the transcripts is now contained in the loci. These loci can be fragmented because of alternative splicing events which cause the de Bruijn graph to have a branch. Oases, therefore, analyses the topology of the loci to extract full length isoform assemblies.
In many cases, the loci present a simple topology which can be trivially and uniquely decomposed as one or two transcripts. We define three categories of trivial locus topologies (): chains, forks and bubbles, which if isolated from any other branching point, are straightforward to resolve. These three topologies are easily identifiable using the degrees of the nodes. Oases, therefore, detects all the trivial loci and enumerates the possible transcripts for each of them.
Because the above exact method only applies to specific cases, an additional robust heuristic method is applied to the remaining loci, referred to as complex
loci. Oases uses a reimplementation of the algorithm described in (Lee, 2003
), which efficiently produces a parsimonious set of putative highly expressed transcripts, assuming independence of the alternative splicing events.
This extension of the algorithm is quite intuitive, since there is a direct analogy between the de Bruijn graph built from the transcripts of a gene and its splicing graph, as noted by Heber et al. (2002
). Using dynamic programming, it enumerates heavily weighted paths through the locus graph in decreasing order of coverage, until either all the contigs of the locus are covered, or a specified number of transcripts is produced (by default 10).
As in the transitive reduction phase, this algorithm had to be slightly modified to allow for loops in the putative splicing graph of the locus. Loops are problematic because their presence can prevent the propagation of the dynamic programming algorithm to all the contigs of a locus. When a loop is detected, it is broken at a contig which connects the loop to the rest of the locus, so as to leave a minimum number of branch points, as described in the Supplementary Material
2.9 Merging assemblies with Oases-M
De Bruijn graph assemblers are very sensitive to the setting of the hash length k. For transcriptome data, this optimization is more complex as transcript expression levels and coverage depths are distributed over a wide range. A way to avoid the dependence on the parameter k is to produce a merged transcriptome assembly of previously generated transfrags from Oases.
Oases is run for a set of [kMIN,…,kMAX] values and the output transfrags are stored. All predicted transfrags from runs in the interval are then fed into the second stage of the pipeline, Oases-M, with a user selected kMERGE. A de Bruijn graph for kMERGE is built from these transfrags. After removing small variants with the Tourbus algorithm, any transfrag in the graph that is identical or included in another transfrag is removed. The final assembly is constructed by following the remaining transfrags through the merged graph.