The pioneering work in metagenomics analysis was done by Venter et al.
) for the Sargasso Sea environmental genome analysis. The bulk extraction of diverse microbial genomes from the environment without prior laboratory cultivation is one of the most fascinating features of metagenomics. There have been several analyses of various kinds of environmental genomes, such as (2
). Recent progress in next-generation sequencing technology offers more opportunities for metagenome analyses and permits deep sequencing (especially the Illumina Genome Analyzer) for highly diverse microbial populations. However, while a number of metagenomes have been sequenced using next-generation sequencers and deposited into public genome databases, only a few studies have reported their assembly results (3
). This is mainly because of the short length of sequence reads from next-generation sequencers. Furthermore, there is also a fundamental difficulty of metagenomics analysis compared with isolated genome analysis. In a microbial community, the number of strains is initially unknown, and their relative abundance is also unknown and potentially skewed (5
There are currently a few ‘de novo
’ assemblers specifically devoted to metagenome assembly from mixed sequence reads of multiple species (6
). In contrast, there are two alternative approaches to ‘de novo
’ analysis for mixtures of sequence reads from environmental genomes: (i) applying a single-genome assembler to metagenome sequence reads (3
) and (ii) binning (clustering) a set of sequence reads into different clusters (10
). However, single-genome assemblers were not designed to assemble multiple genomes from a mixture of sequence reads with nonuniform sequence coverages. On the other hand, the unsupervised binning of sequence reads also has the limitation of clustering the input reads based only on k
-mer frequencies in the ‘short’ reads without assembly. A third approach (not de novo
) is comparative genome analysis mapping reads or aligning contigs to reference genomes (5
). Unfortunately, the comparative approach cannot cover any microbial species whose reference genomes or closely related genomes have not been assembled.
In spite of such difficulties, our primary goal was ‘to reconstruct the whole genomes of multiple species in a microbial community, particularly from very short sequence reads generated by a next-generation sequencer’. To accomplish such de novo
metagenome assembly, we extended a single-genome assembly program (assembler), named ‘Velvet’, using a de Bruijn graph (15
), to a metagenome assembly program for mixed short reads of multiple species. A de Bruijn graph is a data structure for genome assembly programs that compactly represents an overlap between short reads. The de Bruijn graph-based assembly program identifies the overlaps between reads using a de Bruijn graph and merges the reads to reconstruct longer sequences. Note that the de Bruijn graph representation was first used for sequencing by hybridization by (17
Our fundamental strategy for metagenome assembly was to consider that a de Bruijn graph constructed from mixed sequence reads of multiple species is equivalent to the mixture of multiple de Bruijn subgraphs, each of which is constructed from sequence reads of individual species and to decompose the mixed de Bruijn graph into individual subgraphs and build scaffolds based on each decomposed subgraph ().
The MetaVelvet strategy to decompose a mixed de Bruijn graph.
From the ‘monotonic increasing’ property of de Bruijn graph construction, it is obvious that the de Bruijn graph, denoted d
), constructed from the mixture, denoted X1
, of sequence reads of multiple species contains each de Bruijn graph d
) constructed from sequence reads Xi
of individual species (1
) as subgraphs. Therefore, it clearly holds that
and the strategy of decomposing the de Bruijn graph is proven to be reasonable.
In the decomposition step of the de Bruijn graph of multiple species, the coverage (abundance) difference and graph connectivity are used to distinguish a subgraph composed of a single species from the other subgraphs, where the ‘coverage’ is defined to be k-mer frequency in the input sequence reads. When two subgraphs, say dBg(X1) and dBg(X2), contained in the main de Bruijn graph have distinguishable read coverages, we disconnect the two subgraphs based on the coverage difference, such that metagenome assembly problem would be reduced to a set of easier problems of single-genome assemblies based on the decomposed subgraphs dBg(X1) and dBg(X2). For the graph disconnection task, an algorithm to identify shared nodes was developed, called the ‘chimeric node’, between two subgraphs dBg(X1) and dBg(X2). On the other hand, when two species are sufficiently evolutionarily distant, two genomes cannot share any k-mers, and therefore the main de Bruijn graph constructed from mixed reads of two species must be already separated and consist of two separated subgraphs.
For simulated datasets, the MetaVelvet metagenome assembler succeeded in generating higher N50 scores than any comparable assemblers and produced high-quality scaffolds, as well as separate genome assemblies from isolated species sequence reads, where the N50 score is a standard statistical measure that evaluates the assembly quality and indicates the scaffold length such that 50% of the assembled sequences lie in scaffolds of this size or larger. The scaffolds with longer N50 scores especially benefit the identification of protein-coding genes. In fact, the number of predicted complete protein-coding genes from MetaVelvet scaffolds is significantly larger than that produced by any of the other assemblers. Furthermore, MetaVelvet could reconstruct relatively low-coverage genome sequences as scaffolds. On real datasets of human gut microbial short read data sequenced as part of the MetaHIT project (3
), our MetaVelvet produced longer scaffolds, and significantly increased the number of predicted genes. The source code of MetaVelvet is freely available at http://metavelvet.dna.bio.keio.ac.jp
under the GNU General Public License and is distributed as a bundle with ‘Velvet’.