Our algorithms operate on a contig graph. A contig may represent a single unitig or an ungapped concatenation of multiple contigs. For each mate-pair connecting pairs of contigs, we generate a link
l with length
d(
l) and orientation computed from the orientation and positions of the reads in the contigs. The SD σ(
l) is provided as input to Bambus 2. Using the set of links between pairs of contigs, the orientation is set as the orientation of the majority of the links. Once an orientation is selected, we check whether the distance constraints implied by the links are consistent with each other. If not, we discard the smallest number of links that results in a consistent set
S (the largest consistent set can be found in
nlogn time using an algorithm for maximal clique finding in an interval graph). Each consistent set is output as an edge
e with weight
w(
e)=|
S|. The average length

and SD

as suggested in
Huson et al. (2001). Additional information, such as overlaps between adjacent contigs (contigs sharing common sequence), is also included when constructing the edges. The resulting graph is bidirected (
Medvedev et al., 2007).
Scaffolding consists of three operations: orientation, positioning and simplification. Throughout the process, we prune the graph by removing contradictory edges and recording their reason for removal.
To avoid the ambiguity introduced by repeats, we start with a repeat detection step, then exclude all repeat contigs and incident edges from scaffolding. The (possibly multiple) placement of these nodes can be determined after the initial scaffolding is complete.
Centrality-based repeat detection: we calculate the all-pairs-shortest paths with each edge having weight
w=1. For each node,
v, we calculate the number of times it appears on a shortest path:
Pv. Note that larger contigs are expected to have a higher degree because they contain more reads and, therefore, have a higher chance of being the end-point of a mate-pair link. To correct for this, we linearly scale
Pv by the contig length. Such a length-dependent correction has been previously proposed in the context of estimation of gene abundance in metagenomic samples (
Sharon et al., 2009). A node is declared repetitive if the scaled

where
c is a constant (usually set to 3),

is the mean of all scaled
Pv ∀
v
V and σ is the SD of all scaled
Pv ∀
v
V.
Local coverage statistic: for each connected component
S and for each node
v
S, we compute the A-stat value (
Myers et al., 2000). An abundant organism is less likely to appear repetitive in our approach as the connected component is more homogeneous. This operation is carried out after the repeat nodes identified by all-pairs-shortest paths have been removed.
Orientation: we must first convert the bidirected graph into a directed graph by choosing an orientation for each node in the graph. We call reverse edges any pairwise constraints that require the adjacent contigs to be in opposite orientations. It is impossible to assign a consistent order to nodes involved in a cycle with an odd number of reverse edges without discarding edges. We attempt to remove a minimum number of edges to allow a consistent orientation to be assigned. Finding such a minimum set is equivalent to the Maximal Bipartite Subgraph problem which is NP-hard (
Garey and Johnson, 1979). We rely on a greedy heuristic proposed by
Kececioglu and Myers (1995) that achieves a two-factor approximation. The algorithm runs in
O(
V+
E) time.
Positioning: in addition to assigning an edge direction, we want to assign a position for each contig. There may be multiple edges assigning contradictory positions to a node. These imperfect data are the result of experimental errors and repeats (ambiguities in the placement of reads along a genome). We want to maximize the number of satisfied edges by placing nodes as close to the specified position as possible. This problem is similar to the Optimal Linear Arrangement problem which is also NP-hard (
Garey and Johnson, 1979). We rely on the following greedy extension heuristic to linearly order the contigs: scaffolding starts by placing an arbitrary node at position 0. For each node without a position, compute an initial position based on all already-placed neighbors as a weighted average. Subsequent edges can reposition the node within a limit of 3σ(
e) where σ(
e) is the SD of the edge. The extension stops when the ratio of an edge weight
w(
e(
u,
v)) to the maximum weight edge incident on node
u or
v is below a threshold. Edges eliminated from the graph due to invalid orientation are not used in this step. The algorithm runs in
O(
V+
E) time. This heuristic is sufficient once the graph is simplified as above and repeat contigs removed.
Simplification: a transitive reduction is applied to the contig graph and redundant edges are removed. Transitive edges [an edge
e(
u,
v) such that there is a path
p with a set of edges
pe
E incident on nodes
pv
V between
u and
v not including
e(
u,
v)] are removed from acyclical components of the graph by performing a depth-first search from each node in topological order. Given the sequence lengths of contig in the graph
l(
v) ∀
v
V and a path
p, we define the length of the path as
l(
p)=∑
∀ contigs
v
pv l(
v)+∑
∀ edges
e
pe l(
e). Define the SD of the path as σ(
p)=∑
∀ edges
e
pe σ(
e). A transitive edge is removed when |
l(
e)−
l(
p)|≤σ(
e)+σ(
p). These edges can be removed without loss of information. Simple paths (all nodes have in- and out-degree equal to 1) are then collapsed: the nodes on the path are replaced with a single node representing the concatenation of the original nodes, and the intervening edges are removed from the graph. Finally, each simplified connected component in the graph gets reported as a scaffold.
Variant detection: once we have oriented and positioned the contigs and simplified the graph, we iteratively search for variation motifs. We search for subgraphs where multiple paths begin at a source node and collapse to one sink node within a certain number of hops. To allow for artifacts due to incomplete coverage, we allow subgraphs where paths terminate before reaching the sink.
Given graph
G=(
V,
E) and motif set
S
V
That is, the incoming edges may only be incident on the source node and the outgoing edges may only be incident on the sink node. Finally, to avoid false positives due to layouts that satisfy edge constraints but where nodes can be placed in a linear, non-overlapping order, we calculate the overlap ratio.
Given
S
V, node
v
S, start coordinate of
v,
B(
v) and end coordinate of
v,
E(
v)
The overlap ratio is then

. Intuitively, it is the total number of bases covered by two or more nodes, divided by the total number of bases in the motif. Motifs whose overlap ratio exceeds a threshold are marked as a polymorphism. To make the problem tractable, only subgraphs with a diameter of 2 are detected in the current implementation of our algorithm. Each iteration of motif detection has a runtime of
O(|
V|×(Δ(
G)
3+3Δ(
G))) where Δ(
G) is the maximum degree of
G. This algorithm has a worst-case runtime of
O(|
V|×(|
E|
3+3|
E|)). However, in a contig graph it is likely that Δ(
G)<<|
E|. Every level of depth multiplies the runtime by a factor of Δ(
G).
Output: Bambus 2 supports several output formats. Since we do not linearize scaffolds and maintain ambiguity due to variation in the graph, the native output is a graph [in Graphviz format
Gansner and North (2000)]. Bambus 2 also finds the longest sequence reconstruction through each scaffold. That is, it will ignore variant motifs and generate a single self-consistent sequence for each scaffold. Additionally, Bambus 2 outputs each variation motif as a set of sequences. For each motif,
S, we start from the
source node, as defined above. For each child node
c of
source, we recursively compute the sequences starting at
c. The longest sequence starting at
source is the master sequence of the motif. The alternate sequences found in the graph are also output, including edit positions specifying where within the master sequence they belong. b shows an example alignment of the fasta output for a variant region within
E.coli.
Test data: we tested the algorithm using nine datasets.
Brucella suis 1330 comprised 36 080 reads and available as NCBI Trace Archive Project ID 320. The reference includes: AE014291:AE014292 (2 107 792 bp, 1 207 381 bp). Three simulated datasets were generated using MetaSim (
Richter et al., 2008) (). The acid mine drainage dataset, generated by
Simmons et al. (2008);
Tyson et al. (2004), consists of 179 770 reads and is available as NCBI Trace Archive Project ID 13696. The reference AMD dataset includes:
Ferroplasma acidarmanus Type I,
Ferroplasma sp. Type II,
Leptospirillum sp. Group II 5-way CG,
Leptospirillum sp. Group III and
Thermoplasmatales archaeon Gpl and is available as CH003520:CH004435. The Twin Gut data were generated by
Turnbaugh et al. (2008) and is available as
SRA002775 (8.30M GS FLX fragments). The MetaHit datasets were generated by the MetaHit consortium (
Qin et al., 2010) and are available as ERS006526, ERS006594 and ERS006494.