Home | About | Journals | Submit | Contact Us | Français |

**|**BMC Bioinformatics**|**v.11; 2010**|**PMC2996408

Formats

Article sections

Authors

Related links

BMC Bioinformatics. 2010; 11: 560.

Published online 2010 November 15. doi: 10.1186/1471-2105-11-560

PMCID: PMC2996408

Vamsi K Kundeti,^{}^{1} Sanguthevar Rajasekaran,^{}^{1} Hieu Dinh,^{1} Matthew Vaughn,^{2} and Vishal Thapar^{3}

Vamsi K Kundeti: ude.nnocu.rgne@kismav; Sanguthevar Rajasekaran: ude.nnocu.rgne@kesajar; Hieu Dinh: ude.nnocu.rgne@ueih; Matthew Vaughn: moc.liamg@nhguav.ttam; Vishal Thapar: ude.lhsc@rapaht

Received 2010 June 1; Accepted 2010 November 15.

Copyright ©2010 Kundeti et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories **- **based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an *O*(*n/p*) time parallel algorithm has been given for this problem. Here *n *is the size of the input and *p *is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(*n*Σ) messages (Σ being the size of the alphabet).

In this paper we present a Θ(*n/p*) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of $\Theta (\frac{n\mathrm{log}(n/B)}{B\mathrm{log}(M/B)})$ (*M *being the main memory size and *B *being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster **- **both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem.

The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.

Sequencing genomes is one of the most fundamental problems in modern Biology and has immense impact on biomedical research. *De novo sequencing *is computationally more challenging when compared to sequencing with a reference genome. On the other hand the existing sequencing technology is not mature enough to identify/read the entire sequence of the genome **- **especially for complex organisms like the mammals. However small fragments of the genome can be read with acceptable accuracy. The *shotgun *sequencing employed in many sequencing projects breaks the genome randomly at many places and generates a large number of small fragments (*reads*) of the genome. The problem of reassembling all the fragmented reads into a small sequence close to the original sequence is known as the *Sequence Assembly *(SA) problem.

Although the SA problem seems similar to the *Shortest Common Super string *(SCS) problem, there are in fact some fundamental differences. Firstly, the genome sequence might contain several repeating regions. However, in any optimal solution to the SCS problem we will not be able to find repeating regions **- **because we want to minimize the length of the solution string. In addition to the repeats, there are other issues such as errors in reads and double strandedness of the reads which make the reduction to SCS problem very complex.

Existing algorithms to address the SA problem can be classified into two broad categories. The first class of algorithms model a read as a vertex in a directed graph **- **known as the *overlap graph *[1]. The second class of algorithms model every substring of length *k *(i.e., a *k*-mer) in a read as a vertex in a (subgraph of) a *de Bruijn *graph [2].

In an overlap graph, for every pair of overlapping reads, directed edges are introduced consistent with the orientation of the overlap. Since the transitive edges in the overlap graph are redundant for the assembly process they are removed and the resultant graph is called the *string graph *[1]. The edges of the string graph are classified into *optional*, *required *and *exact*. The SA problem is reduced to the identification of a shortest walk in the string graph which includes all the required and exact constraints on the edges. Identifying such a walk **- ***minimum S-walk*, on the string graph is known to be NP-hard [3].

In the de Bruijn graph model every substring of length *k *in a read acts as a vertex [2]. A directed edge is introduced between two *k*-mers if there exists some read in which these two *k*-mers overlap by exactly *k *- 1 symbols. Thus every read in the input is mapped to some path in the de Bruijn graph. The SA problem is reduced to a *Chinese Postman Problem *(CPP) on the de Bruijn graph, subject to the constraint that the resultant CPP tour include all the paths corresponding to the reads. This problem (*Shortest Super Walk*) is also known to be NP-hard. Thus solving the SA problem exactly on both of these graph models is intractable.

Overlap graph based algorithms were found to perform better (see [4-7]) with Sanger based read methods. Sanger methods produce reads typically around 1000 base pairs long and are very reliable. Unfortunately Sanger based methods are very expensive. To overcome the issues with Sanger reads, new read technologies such as sequencing by synthesis (Solexa) and pyrosequencing (454 sequencing) have emerged. These rapidly emerging read technologies are collectively termed as the *Next Generation Sequencing *(NGS) technologies. These NGS technologies can produce shorter genome fragments (anywhere from 25 to 500 base-pairs long). NGS technologies have drastically reduced the sequencing cost per base-pair when compared to Sanger technology. The reliability of NGS technology is acceptable, although it is relatively lower than the Sanger technology. In the recent past, the sequencing community has witnessed an exponential growth in the adoption of these NGS technologies.

On the other hand these NGS technologies can increase the number of reads in the SA problem by a large magnitude. The computational cost of building an overlap graph on these short reads is much higher than that of building a de Bruijn graph. De Bruijn graph based algorithms handle short reads very efficiently (see [8]) in practice compared to the overlap graph based algorithms. However the major bottleneck in using de Bruijn graph based assemblers is that they require a large amount of memory to build the de Bruijn graph.

This limits the applicability of these methods to large scale SA problems. In this paper we address this issue and present algorithms to construct large de Bruijn graphs very efficiently. Our algorithm is optimal in the sequential, parallel, and out-of-core models. A recent work by Jackson and Aluru [9] yielded parallel algorithms to build these de Bruijn graphs efficiently. They present a parallel algorithm that runs in *O*(*n/p*) time using *p *processors (assuming that *n *is a constant-degree polynomial in *p*). The *message complexity *of their algorithm is Θ(*n*Σ). By message complexity we mean the total number of messages (i.e., *k*-mers) communicated by all the processors in the entire algorithm. The distributed de Bruijn graph building algorithm in ABySS [10] is similar to the algorithm of [9].

One of the major contributions of our work is to show that we can build a bi-directed de Bruijn graph in Θ(*n/p*) time with a message complexity of Θ(*n*). An experimental comparison of these two algorithms on an SGI Altix machine shows that our algorithm is considerably faster. In addition, our algorithm works optimally in an out-of-core setting. In particular, our algorithm requires only $\Theta (\frac{n\mathrm{log}(n/B)}{B\mathrm{log}(M/B)})$ I/O operations.

Let *s * Σ* ^{n }*be a string of length

A *bi-directed *graph is a generalized version of a standard directed graph. In a directed graph every edge has only one arrow head (- or -). On the other hand in a bi-directed graph every edge has two arrow heads attached to it (-,-,-, or -). Let *V *be the set of vertices and *E *= {(*v _{i}, v_{j }, o_{1}, o*

A de Bruijn graph *D ^{k}*(

*E *= {(*v _{i}, v_{j }*)| suf(

To address this a bi-directed de Bruijn graph $B{D}^{k}(S\overline{S})$ has been suggested in [3]. The set of vertices *V *of $B{D}^{k}(S\overline{S})$ consists of all possible *k*-molecules from *S * $\overline{S}$. The set of bi- directed edges for $B{D}^{k}(S\overline{S})$ is defined as follows. Let *x, y *be two *k*-mers which are next to each other in some input string $z(S\overline{S})$. Then an edge is introduced between the *k*- molecules *v _{i }*and

Figure Figure22 illustrates a simple example of the bi-directed de Bruijn graph of order *k *= 3 from a set of reads *ATGG, CCAT, GGAC, GTTC, TGGA *and *TGGT *observed from a DNA sequence *ATGGACCAT *and its reverse complement *ATGGTCCAT*. Consider two 3-molecules *v _{1 }*= (

In this section we describe our algorithm BiConstruct to construct a bi-directed de Bruijn graph on a given set of reads. The following are the main steps in our algorithm to build the bi-directed de Bruijn graph. Let *R _{f }*= {

• [STEP-1] **Generate canonical edges: **Let (*x, y*) = (*z*[1 . . . *k*], *z*[2 . . . *k *+ 1]) be the *k*-mers corresponding to a (*k *+ 1)-mer *z*[1 . . . *k *+ 1] *R*^{k+1}. Recall that $\widehat{x}$ and *ŷ *are the canonical *k*- mers of *x *and *y*, respectively. Create a canonical bi-directed edge $({\widehat{v}}_{i},{\widehat{v}}_{j},{o}_{1},{o}_{2})$ for each (*k *+ 1)-mer as follows.

$$(\widehat{{v}_{i}},\widehat{{v}_{j}},{o}_{1},{o}_{2})=\{\begin{array}{l}\begin{array}{c}(\widehat{x},\widehat{y},,)\text{x=x^,y=y^(y^,x^,,)IFx^\u2264y^,ELSEx\u2260x^\u2227y=y^(x^,y^,,)IFx^\u2264y^(y^,x^,,)ELSEx=x^\u2227y\u2260y^(x^,y^,,)IFx^\u2264y^(y^,x^,,)ELSEx\u2260x^\u2227y\u2260y^(x^,y^,,)IFx^\u2264y^,(y^,x^,,)ELSE}\end{array}\end{array}$$

• [STEP-2] **Reduce multiplicity: **Sort all the bi-directed edges in [STEP-1], using radix sort. Since the parameter *k *is always odd this guarantees that a pair of canonical *k*-mers has exactly one orientation. Remove the duplicates and record the multiplicities of each canonical edge. Gather all the unique canonical edges into an edge list .

• [STEP-3] **Collect bi-directed vertices: **For each canonical bi-directed edge $({\widehat{v}}_{i},{\widehat{v}}_{j},{o}_{1},{o}_{2})$, collect the canonical *k*-mers ${\widehat{v}}_{i}$, ${\widehat{v}}_{j}$ into list $\mathcal{V}$. Sort the list $\mathcal{V}$ and remove duplicates, such that $\mathcal{V}$ contains only the unique canonical *k*-mers.

• [STEP-4] **Adjacency list representation: **The list is the collection of all the edges in the bi-directed graph and the list $\mathcal{V}$ is the collection of all the nodes in the bi-directed graph. It is now easy to use and generate the adjacency lists representation for the bi-directed graph. This may require one extra radix sorting step.

**Theorem 1**. *Algorithm *BiConstruct *builds a bi-directed de Bruijn graph of order k in *Θ(*n*) *time. Here n is number of characters/symbols in the input*.

*Proof*. Without loss of generality assume that all the reads are of the same size *r*. Let *N *be the number of reads in the input. This generates a total of (*r *- *k*)*N *(*k *+ 1)-mers in [STEP-1]. The radix sort needs to be applied in at most 2*k *log(|Σ|) passes, resulting in 2*k *log(|Σ|)(*r *- *k*)*N *operations. Since *n *= *Nr *is the total number of characters/symbols in the input, the radix sort takes Θ(*kn *log(|Σ|)) operations assuming that in each pass of sorting only a constant number of symbols are used. If *k *log(|Σ|) = *O*(log *N *), the sorting takes only *O*(*n*) time. In practice since *N *is very large in relation to *k *and |Σ|, the above condition readily holds. Since the time for this step dominates that of all the other steps, the runtime of the algorithm BiConstruct is Θ(*n*).

In this section we illustrate a parallel implementation of our algorithm. Let *p *be the number of processors available. We first distribute *N/p *reads for each processor. All the processors can execute [STEP-1] in parallel. In [STEP-2] we need to perform parallel sorting on the list . Parallel radix/bucket sort **-**which does not use any all-to-all communications**- **can be employed to accomplish this. For example, the integer sorting algorithm of Kruskal, Rudolph and Snir takes $O\left(\frac{n}{p}\frac{\mathrm{log}n}{\mathrm{log}(n/p)}\right)$ time. This will be *O*(*n/p*) if *n *is a constant degree polynomial in *p*. In other words, for coarse-grain parallelism the run time is asymptotically optimal **- **which means optimality within a constant. In practice coarse-grain parallelism is what we have. Here *n *= *N *(*r *- *k *+ 1). We call this algorithm Par-BiConstruct.

**Theorem 2**. *Algorithm *Par-BiConstruct *builds a bi-directed de Bruijn graph in time O*(*n/p*). *The message complexity is O*(*n*).

The algorithm of Jackson and Aluru [9] first identifies the vertices of the bi-directed graph **- **which they call representative nodes. Then for every representative node |Σ| many-to-many messages are generated. These messages correspond to potential bi-directed edges which can be adjacent on that representative node. A bi-directed edge is successfully created if both the representatives of the generated message exist in some processor, otherwise the edge is dropped. This results in generating a total of Θ(*n*|Σ|) many-to-many messages. The authors in the same paper demonstrate that communicating many-to-many messages is a major bottleneck and does not scale well. We remark that the algorithm BiConstruct does not involve any many-to-many communications and does not have any scaling bottlenecks.

The algorithm presented in [9] can potentially generate spurious bi-directed edges. According to the definition [3] of the bi-directed de Bruijn graph in the context of SA problem, a bi-directed edge between two *k*-mers/vertices exists if and only if there exists some read in which these two *k*-mers are adjacent. We illustrate this by a simple example. Consider a read *r _{i }*=

**Theorem 3**. *There exists an out-of-core algorithm to construct a bi-directed de Bruijn graph using an optimal number of I/O***'***s*.

*Proof*. Replace the radix sorting with an external *R*-way merge sort which takes only $\Theta (\frac{n\mathrm{log}(n/B)}{B\mathrm{log}(M/B)})$ I/O**'**s. Here *M *is the main memory size, *n *is the sum of the lengths of all reads, and *B *is the block size of the disk.

The bi-directed de Bruijn graph constructed in the previous section may contain several linear chains. These chains have to be compacted to save space as well as time. The graph that results after this compaction step is referred to as the *simplified bi-directed graph*. A linear chain of bi-directed edges between nodes *u *and *v *can be compacted only if we can find a valid bi-directed walk connecting *u *and *v*. All the *k*-mers/vertices in a compactable chain can be merged into a single node, and relabelled with the corresponding forward and reverse complementary strings. In Figure Figure33 we can see that the nodes *X*_{1 }and *X*_{3 }can be connected with a valid bi-directed walk and hence these nodes are merged into a single node. In practice the compaction of chains plays a very crucial role. It has been reported that merging the linear chains can reduce the number of nodes in the graph by up to 30% [8].

Although the bi-directed chain compaction problem seems like a *list ranking *problem there are some fundamental differences. Firstly, a bi-directed edge can be traversed in both the directions. As a result, applying *pointer jumping *directly on a bi-directed graph can lead to cycles and cannot compact the bi-directed chains correctly. Figure Figure44 illustrates the first phase of pointer jumping. Pointer jumping is an operation on a directed chain/linked list which changes the neighbour of a list node to its neighbour's neighbour. As we can see, the *green *arcs indicate valid pointer jumps from the starting nodes. However since the orientation of the node *Y*_{4 }is reverse relative to the direction of pointer jumping a cycle results. In contrast, a valid bi-directed chain compaction would merge all the nodes between *Y*_{1 }and *Y*_{5 }since there is a valid bi-directed walk between *Y*_{1 }and *Y*_{5}. On the other hand, bi-directed chain compaction may result in changing the orientation of some bi-directed edges and these edges should be recognised and updated accordingly. Consider a bi-directed chain in Figure 3(a). This chain contains two possible bi-directed walks **- ***Y*_{2 }to *Y*_{3 }and *X*_{1 }to *X*_{3}, see Figure 3(b). The walk from *Y*_{3 }to *Y*_{2 }(*Y*_{2 }to *Y*_{3}) spells out a label *TAGG*(*CCTA*) after compaction. Once we perform this compaction (see Figure 3(c)) the orientation of the edge between *Y*_{3 }and *Y*_{4 }in the original graph is no longer valid, because the label *CCTA *on the newly compacted node cannot overlap with the label *GGT *on the node *Y*_{4}. However the label *TAGG *on the newly compacted node overlaps with label *GGT *on the *Y*_{4 }and hence its orientation should be updated. On the other hand after the edge between *X*_{1 }and *Y*_{1 }does not need any update after compacting the nodes *X*_{1}, *X*_{2 }and *X*_{3}, see Figure 3(b) and Figure 3(c).

Since bi-directed chain compaction has a lot of practical importance, efficient and correct algorithms are essential. We now provide algorithms for the bi-directed chain compaction problem. Our key idea here is to transform a bi-directed graph into a directed graph and then apply *list ranking*. We define the ListRankingTransform as an operation which replaces every bi-directed edge with a pair of directed edges with some orientation **- **see Figure Figure55 for these orientations.

Given a list of candidate canonical bi-directed edges, we apply a ListRankingTransform (see Figure Figure5)5) which introduces two new nodes *v*** ^{+}**,

**Lemma 1**. *Let BG*(*V, E*) *be a bi-directed graph; let BG ^{t}*(

*Proof*. We first prove the forward direction by induction on the number of nodes in the bi- directed graph. Consider the *basis *of induction when |*V *| = 2, let *v*_{0}, *v*_{1 } *V*. Clearly we are only interested when *v*_{0 }and *v*_{1 }are connected by a bi-directed edge. By the definition of ListRankingTransform the Lemma in this case is trivially true. Now consider a bi-directed graph with |*V *| = *n *+ 1 nodes, if the path between *v _{i}, i < n *and

Given a bi-directed graph we now give an outline of the algorithm which compacts all the valid bi-directed chains.

• STEP-1: Apply the ListRankingTransform for each bi-directed edge. Let the resultant directed graph be *G*(*V, E*).

• STEP-2: Identify a subset of nodes *V*' = {*v *: *v * *V *, (*d*_{in}(*v*) *>*1 or *d*_{out}(*v*) *>*1)} and a subset of edges *E*' = {(*u, v*) : (*u, v*) *E*, (*u * *V*' or *v * *V*')}.

• STEP-3: Apply pointer jumping on the directed graph *G*(*V *- *V*', *E *- *E*').

• STEP-4: Now let $({v}_{i}^{+},\dots ,{v}_{j}^{-})$ be a maximal chain obtained after pointer jumping. Due to the symmetry in the graph there exists a corresponding complementary chain $({v}_{j}^{+},\dots ,{v}_{i}^{-})$. Each chain is replaced with a single node and its label is the concatenation of all the labels in the chain. We should stick to some convention while we give a new number to the newly created node. For example, we can choose min{*v _{i}, v_{j }*} as the new number for the newly created node. In our example if

• STEP-5: Finally, to maintain the connectivity we need to update the edges in *E*' to reflect the changes during compaction. Coming back to our example, we have replaced the chain $({v}_{i}^{+},\dots ,{v}_{j}^{-})$ with the node ${v}_{i}^{+}$. As a result we need to replace any edge $(x,{v}_{j}^{-})E\text{'}$ with the edge $(x,{v}_{i}^{+})$. Similarly we also need to update any edges adjacent on ${v}_{j}^{+}$.

Note that all of the above steps can be accomplished with some constant number of radix sorting operations. Figure Figure33 illustrates the compaction algorithm on a bi-directed graph. The red nodes in Figure Figure33 indicate the nodes in the set *V*'. Red colored edges indicate the edges in *E*'. After list ranking we will have four maximal chains as follows: $({Y}_{2}^{-},{Y}_{3}^{+}),\text{(Y3\u2212,Y2+),(X1\u2212,X2+,X3\u2212)(X3+,X2\u2212,X1+)}$. Now if we stick to the convention described in STEP-5 we renumber the new node corresponding to the chains $({Y}_{2}^{-},{Y}_{3}^{+}),\text{(Y3\u2212,Y2+)}$ as *Y*_{2}. As a result the edges $({Y}_{3}^{+},{Y}_{4}^{-})$ and $({Y}_{4}^{+},{Y}_{3}^{-})$ are updated (shown in green in Figure 3(c)) as $({Y}_{2}^{-},{Y}_{4}^{-})$ and $({Y}_{4}^{+},{Y}_{2}^{+})$. Finally the directed edges are replaced with bi-directed edges in Figure 3(d).

Let * _{l }*be the list of candidate edges for compaction. To do compaction in parallel, we can use a

In this section we briefly describe how our algorithms can be used to speedup some of the existing SA programs. As an example, we consider VELVET [8]. VELVET is a suite of programs **- **velveth and velvetg, which has recently gained acclamation in assembling short reads. The VELVET program builds a simplified bi-directed graph from a set of reads. We now briefly describe the algorithm used in VELVET to build this graph. The VELVET program puts all the *k*-mers from the input into a hash table and then identifies the *k*-mers which are present in at least 2 reads **- **this information is called the *roadmap *in VELVET**'**s terminology. The program then builds a de Bruijn graph using these *k*-mers. Finally it takes every read and threads it on these *k*-mers. The worst case time complexity is *O*(*n *log(*n*)) **- **assuming that the implementation of hash table is based on a balanced search tree. However VELVET uses a Splay tree so this would be the amortized runtime rather than the worst case. Since VELVET builds this graph entirely in-memory, this has some serious scalability problems especially on large scale assembly projects. However VELVET has some very good assembly heuristics to remove errors and identify redundant assembly paths, etc. Our out-of-core algorithm can act as a replacement to the code in VELVET that performs in-memory graph construction. The internal de Bruijn graph of VELVET is slightly different from the bi-directed graph our algorithm builds. It is easy to see the equivalence between these two representations. (See Figure Figure7).7). We have implemented an out-of-core algorithm that takes in a file with reads and the value of *k *and generates the graph for VELVET program. To be more precise, VELVET program creates a file with the name Graph in the directory when we run the velvetg program. We have modified the code in the VELVET program by adding an option which quits after it builds the Graph file without any simplification. This gives us a chance to compare the VELVET**'**s algorithm which can build the Graph file with our algorithm. The results and more details about the program are in the results section.

Before we go into the discussion we briefly describe our experimental setting. We used a SGI-Altix 64-bit, 64 node supercomputer with a RAM of 2 GB per node for our parallel algorithm experiments. For our sequential out-of-core experiments we used a 32-bit x86 machine with 1 GB of RAM. All our algorithms are implemented in C under Linux environment.

We have compared the performance of our algorithm and that of Jackson and Aluru [9]. We refer to the later algorithm as JA. To make this comparison fair, we have implemented the JA algorithm (because their implementation is unavailable) also on the same platform that our algorithm runs on. We have used the SGI/Altix IA-64 machine for all of our experiments. Our implementation uses MPI for communication between the processors. We used a test set of 8 million un-paired reads obtained from sequencing a plant genome at CSHL. The performance of both the algorithms is measured with various values of *k *(the de Bruijn graph parameter) on multiple processors. Both the algorithms (JA and ParBiConstruct) use the same underlying parallel sorting routines.

Table Table11 shows the user and system times for both our algorithm and the JA algorithm. We can clearly see that the system time (communication time) is consistently higher for the JA algorithm. Also notice that as we increase the value of *k *keeping the number of processors fixed both the algorithms become faster. On the other hand ParBiConstruct is consistently superior than the JA algorithm for all the parameters. Also notice the speedup (time taken by the JA algorithm divided by the time taken by ParBiConstruct) of ParBiConstruct in Table Table1.1. We obtain a maximum speedup of 24.9*X *on the real time when *k *= 21 and *p *= 64. This is clearly because the communication cost is very high and the JA algorithm enumerates all the possible overlaps. Finally, over all the settings for *k *and *p *ParBiConstruct is around 8*X *faster than the JA algorithm with a maximum speedup of 25*X *when *k *= 21, *p *= 64.

Comparison of runtime between the JA algorithm and our algorithm on an input with 8 million reads from a plant genome

The user time of our algorithm is also consistently superior compared to the user time of JA. This clearly is because we do much less local computations. In contrast, JA needs to do a lot of local processing, which arises from processing all the received edges, removing redundant ones, and collecting the necessary edges to perform many-to-many communications.

We also compare the memory used by both the algorithms. We briefly describe how we obtained the memory reports in our experiments. Since the memory used by each processor is different during execution at any given instance we add-up the memory used by each processor and divide by the number of processors and we report this number in our experiments. We obtained the resident memory and shared memory from the top command and averaged it over the number of memory probe samples obtained by top. Table Table22 gives details of the memory usage of both the algorithms. From these results it is clear that our approach is also efficient compared to the JA algorithm. ParBiConstruct uses upto 4*X *less resident memory compared to the JA algorithm.

Comparison of memory between the JA algorithm and our algorithm on an input with 8 million reads from a plant genome.

Since JA takes a significant amount of time for inputs larger than 8 million, we have compared these algorithms only for input sizes up to 8 million. The experimental results reported in [9] start with at least 64 processors. We however show the scalablity of our algorithm for up to 128 million (randomly generated) reads in Table Table3.3. Table Table33 clearly demonstrates the scalability of our algorithm. We make our implementations and all the details of test cases used available at http://trinity.engr.uconn.edu/~vamsik/ParBiDirected.

Scalability of our algorithm for up to 128 million reads. Since our test dataset contained only 8 million reads, we generated these reads randomly, each read was of size 35 and *k *= 33 was used.

Our aim is to study the computational efficiency of the current VELVET**'**s algorithm to build the de Bruijn graph and our algorithm. To accomplish this we have modified the code of VELVET to stop once it completes building the graph from the reads. This is done as follows. Firstly we run the velveth program to complete the building of RoadMaps. The code of velvetg is modified such that the program dumps out the Graph file after threading of the reads. Our out-of-core algorithm generates Graph file directly by taking the reads file and the value of *k*. We have used a low end desktop 32-bit machine with 1 GB RAM to demonstrate the scalability of our out-of-core algorithm. Our results indicate that the VELVET algorithm starts *virtual memory trashing *[12] for around 4 million reads with *k *= 21. This trashing leads to massive increase in the page-faults and stalls the program from progressing further. Thus the VELVET algorithm cannot build large bi-directed graphs. In contrast to VELVET our algorithm works with a constant (user specified) amount of memory and scales well for building large amounts of reads **- **which we demonstrate in Table Table44.

The program is available at http://trinity.engr.uconn.edu/~vamsik/ex-build-vgraph/.

In this paper we have presented an efficient algorithm to build a bi-directed de Bruijn graph, which is a fundamental data structure for any sequence assembly program **- **based on an Eulerian approach. Our algorithm is also efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Also, our algorithm does not employ any all-to-all communications in a parallel setting and performs better than that of [9]. Finally, our out-of-core algorithm can build these graphs with a constant amount of RAM, and hence can act as a replacement for the graph construction algorithm employed by VELVET [8].

VK designed and implemented the algorithms, and drafted the manuscript. SR participated in designing the algorithms, revised the manuscript, and coordinated all phases of the project. HD also participated in the design of algorithms and contributed towards the figures in the manuscript. MV and VT introduced the sequence assembly problem to the team and provided short read data to validate our method. All the authors read and approved the final manuscript.

This work has been supported in part by the following grants: NSF 0326155, NSF 0829916 and NIH 1R01LM010101-01A1.

- Kececioglu JD, Myers EW. Combinatorial algorithms for DNA sequence assembly. Algorithmica. 1995;13(1-2):7–51. doi: 10.1007/BF01188580. [Cross Ref]
- Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America. 2001;98(17):9748–9753. doi: 10.1073/pnas.171285098. [PubMed] [Cross Ref]
- Medvedev P, Georgiou K, Myers G, Brudno M. Computability of models for sequence assembly. Workshop on Algorithms for Bioinformatics (WABI), LNBI-4645. 2007. pp. 289–301. full_text.
- Huang X, Wang J, Aluru S, Yang S, Hillier L. PCAP: A whole-genome assembly program. Genome research. 2003;13(9):2164–2170. doi: 10.1101/gr.1390403. http://www.scopus.com [Cited By (since 1996): 61] [PubMed] [Cross Ref]
- Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, Anson EL, Bolanos RA, Chou H, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC. A whole-genome assembly of Drosophila. Science. 2000;287(5461):2196–2204. doi: 10.1126/science.287.5461.2196. [PubMed] [Cross Ref]
- Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES. ARACHNE: A whole-genome shotgun assembler. Genome research. 2002;12:177–189. doi: 10.1101/gr.208902. [PubMed] [Cross Ref]
- PHRAP ASSEMBLER. http://www.phrap.org/
- Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome research. 2008;18(5):821–829. doi: 10.1101/gr.074492.107. [PubMed] [Cross Ref]
- Jackson BG, Aluru S. Parallel construction of bidirected string graphs for genome assembly. International Conference on Parallel Processing. 2008. pp. 346–353. full_text.
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. ABySS: A parallel assembler for short read sequence data. Genome research. 2009;19(6):1117–1123. doi: 10.1101/gr.089532.108. [PubMed] [Cross Ref]
- Jaja J. Introduction to Parallel Algorithms. Addison Wesley;
- Silberschatz A, Baer P, Gagne G. Operating System Princples. Wiley;

Articles from BMC Bioinformatics are provided here courtesy of **BioMed Central**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |