Metagenomics, also called environmental sequencing, is the study of microbial genomes sampled directly from the environment. We are seeing more metagenomics projects than ever before, due to (i) advances in next-generation sequencing (NGS) technology, such as Roche/454 (Margulies et al., 2005
) and Illumina/Solexa (Bentley, 2006
); (ii) the fact that only a few species can be cultured and studied using conventional microbiological techniques (Schloss and Handelsman, 2005
) and (iii) many studies that have shown the impact of the ‘microbiome’ (i.e. the entire set of genomes in a microbial community) on almost every aspect of life on Earth [e.g. microbes residing in the human body encode far more genes than the genes encoded by the human genome (Gill et al., 2006
)]. Metagenomics has been applied to many studies of natural environments (Venter et al., 2004
; Tyson et al., 2004
) as well as human and animal associated microbiomes (Hess et al., 2011
; Peterson et al., 2009
; Qin et al., 2010
; Turnbaugh et al., 2006
), providing an unprecedented opportunity to gain knowledge about the vast majority of uncultured microbial species.
One of the first steps to analyzing metagenomic sequences is to assemble the reads. For example, reads sequenced from the Acid Mine Drainage dataset yielded two near-complete and three partial genomes (Tyson et al., 2004
). This, however, is a very simple bacterial community; assembling sequences sampled from most whole microbial communities remains a challenging problem (Pop, 2009
). Since most metagenomic sequences are obtained using NGS technology, the traditional ‘overlap-layout-consensus’approach may not be realistic due to the short reads. On the other hand, the de Bruijn graph approach (Compeau et al., 2011
), which breaks the reads into k
-mers and then constructs a de Bruijn graph on these k
-mers, is difficult because of the mixture of genomic sequences from many species and the higher rate of NGS sequencing errors. As a result, it is difficult to assemble complete genomes from metagenomic sequences, even when the community structure is simple—for example, the Acid Mine Drainage dataset (Tyson et al., 2004
) contains merely five species, but yields only two nearly complete genomes.
One of the characteristics of de Bruijn graph-based assemblers is that the resulting graph is usually very tangled, especially when sequencing errors exist. This greatly impedes the formation of long contigs, because the branches cannot be resolved. Moreover, k
-mers from different regions or even from different species may be connected together, which further complicates the structure of the de Bruijn graph. As a result, many short contigs will be reported, which are often insufficient for downstream analysis, such as ab initio
gene prediction in these short contigs (Hoff, 2009
), or homology searches of the contigs (Wommack et al., 2008
). For instance, the MetaHIT consortium only considered contigs of length >500 bp, which represented only 42.7% of the sequencing reads (Qin et al., 2010
Salzberg et al. proposed a gene-boosted assembly approach to improve assembly quality, which used proteins from reference genomes to recruit sequencing reads to fill in the gaps between contigs (2008). Combining this approach with several other strategies, they successfully produced 76 contigs from 8 27 900 33 bp reads obtained from Pseudomonas aeruginosa PAb1, with the largest contig being 512 638 bp. They also demonstrated that most of the genes in a newly sequenced bacterial strain can be assembled using the genome of another strain of the same species as the reference, using gene-boosted assembly. This approach, however, was only applied to single genome assembly problems. Metagenome assembly is more difficult, because of the presence of homologous genes from multiple species in the same community that may behave like repeats for assemblers. Hence, the success of the approach relies on the utilization of a closely related genome (e.g. the genome of the same species but a different strain), which may not be available in metagenomics, which aims to study un-cultured microbial species in natural habitats.
Here, we present GeneStitch to infer gene paths (sequences of contigs), each of which represents a gene or a gene fragment, in the tangled de Bruijn graph resulted from de novo assembly of metagenomic reads, using a network matching algorithm. Given a reference gene sequence, GeneStitch searches for a path in the de Bruijn graph that is most similar to the given reference gene. Assuming that the gene paths found by GeneStitch consist of reads most likely sampled from a real gene, we can assemble genes in a metagenomic dataset by using known homologous genes as references. When prior knowledge of the species composition and gene contents of the sequenced metagenome is unavailable, we can use as many reference gene sequences as possible (e.g. the entire set of genes from all available microbial genomes) to guide the inference of gene paths.
One challenge of inferring gene paths is the separation of very similar genes in a metagenome. The gene paths inferred from GeneStitch may overlap substantially with each other, because homologous genes will share identical regions. Instead of attempting to separate these individual genes (with the risk of introducing misassemblies), we propose to merge these paths into ‘gene graphs’, each of which is a subgraph of the de Bruijn graph that contains reads from the same gene family (homologous genes). We argue that such gene graphs may be considered as single units for downstream analysis of metagenomes, for example for functional predictions by similarity search.
We test our approach on simulated single-genome and metagenomic datasets, and a mock dataset (Morgan et al., 2010
), which consists of real sequencing reads from an artificial community of 10 already-sequenced genomes. The results show that we are able to generate more complete genes by applying GeneStitch, and most of the gene graphs consist only of contigs from homologous genes.