We have evaluated metagenomic assemblies based on the accuracy of the generated contigs using alignment-based similarity to the source genomes, contig length statistics, and the proportions of the source genomes recovered by the contigs. As the k-mer size of de Bruijn graph plays a crucial role in ABYSS’s assembly, we have also tried to assess the impact of k-mer size on metagenomic assemblies by comparing the contigs produced from different runs of the k-mer parameter. We also provide comparisons of metagenomics assemblies to the isolate assemblies of the constituent genomes.
Impact of de Bruijn graph k-mer size on contig length
Figure shows a comparison of the contig lengths to k-mer size across the simLC, simMC and simHC datasets. The average lengths of the contigs decrease with an increase in the complexity of the dataset. From Figure it can be seen that the optimal value of k to obtain longer contigs changes from 29 for simLC (a single dominant genome at a very high coverage) to 21 and 23 for simHC, which does not have any distinctly dominant genomes. As seen from Figure , the simMC dataset k = 25 seems to produce the longer contigs. The clustered results effectively pool the contigs produced from different runs of assembly by varying the k-mer parameter values across the different datasets. Figure shows the distribution of number of bases recovered by the contigs for different cutoff lengths of the contigs and different runs of the assembly. It can be seen from the figure that in general the shorter contigs account for majority of the bases recovered. In all the cases, the number of bases recovered increased with the use of smaller values of k-mer parameter as a greater fraction of the low coverage genomes were being assembled. Table summarizes these statistics regarding the assembled contigs and also provides N50 (weighted median) values.
Figure 1 Contig length distribution The total number of contigs (in log scale) shorter than a given cutoff length, for each of the datasets. (A) simLC. (B) simMC. (C) simHC. (D) Comparison of length distribution of the clustered contigs for metagenomic and isolate (more ...)
Total bases recovered at different contig length cutoffs The total number of bases contained in all the contigs shorter than a certain cutoff length (A) simLC. (B) simMC. (C) simHC.
Contig Alignment Statistics.
Comparison of metagenomic assembly to isolate assembly
As a benchmark for our metagenomic assemblies we separated the reads by their source sequence and performed isolate assemblies. We assembled the reads from each individual sequence separately and combined the final contigs from all the source sequences. We performed the isolate assemblies with different values of k and pooled the results using the clustering algorithm. Figure compares the length distribution of clustered results form metagenomic and isolate assemblies. The simHC dataset produced shorter contigs in both isolate as well as metagenomic assemblies. Amongst the simLC and simHC datasets, the performance of simLC was closer to the isolate assemblies, whereas, the simMC metagenomic assembly was far poorer in comparison to its isolated assembly.
Contig alignment accuracy
Even assemblies of isolate genomes are not completely error free. In the case of metagenomes, the presence of multiple genomes at different coverage depths causes additional problem and the contigs are expected to have more mis-assemblies compared to the contigs from isolate genome assemblies. To assess the accuracy of the assembly we aligned the contigs back to the source genomes. Table reports the results for the different datasets. A threshold accuracy of 95% was used for considering a contig accurate. Details of alignment methods and accuracy calculations are provided in the methods section. The assembly accuracy decreased as the k-mer size was decreased and was worst for all datasets at k=21. Further, the accuracy of the clustered contigs was lowest, due to the accumulation of errors from all the contig sets. This is due to our clustering approach, which tries to retain all the unique sequences. An alternative clustering strategy could be designed that takes the consensus of contigs obtained from different runs. This strategy would improve the accuracy of the results while reducing the total number of bases recovered.
In an ideal case of metagenomic assembly, all the reads forming a contig would come from the same source sequence. In metagenomes the reads from different sources may be co-assembled, resulting in chimeric assemblies. Therefore, to estimate the degree of chimericity, we evaluated the homogeneity of contigs using their read composition. Essentially, greater the number of sources from which a contig is assembled, the higher its entropy will be. The methods section describes the alignment of reads to the assembled contigs and entropy calculations.
Figure shows a plot of contig entropy versus length of the contigs across the datasets of different complexities. The entropy metric is computed at four levels: (i) sequence, (ii) species, (iii) genus and (iv) phylum, derived from the NCBI taxonomy tree. The simHC dataset produces a large number of smaller inhomogeneous contigs due to insufficient coverage of the source sequences. The proportion of inhomogeneous contigs is comparatively lower in the MC and significantly lower in LC datasets. The contigs were more homogeneous at higher phylogenetic levels. Because the genomes which are phylogenetically close together share significant sequence similarity, there is a greater chance of assembling reads belonging to related sequences into the same contig.
Contig entropy at different taxonomic levels The entropy of contigs versus the contig length (in log scale) for the datasets (A) simLC. (B) simMC. (C) simHC.
Coverage of the source sequences
Figure shows a plot of the source coverage ratio for the clustered contig sets of simLC, simMC, and simHC datasets. The positions of the source sequences in Figure correspond to those in the read coverage plot discussed in methods section. Due to space constraints, we did not include the plots showing the coverage values for different k-mer sizes, but we summarize the results here. In almost all the assemblies a high proportion of source genomes sequenced at higher depth was recovered by the contigs. As the value of k-mer size was decreased, more of the genomes sequenced at lower depth were recovered. However, as evident from the contig length distribution plot, Figure , some values of k-mer size tend to be suboptimal length-wise, depending on the complexity of the datasets. Therefore, clustering of the contigs resolves this issue, as the clustered results retain the longer contigs and also the unique contigs representing the low read coverage genomes.
Coverage of source sequences from the clustered contigs.
Escherichia strains co-assembly
Since the collection of DNA sequences for metagenomic experiments does not involve cloning, the reads could come from strains which are highly similar, with very little sequence variation. In this case, even though the effective read coverage of the species is high, due to minor differences in the sequences of the strains, the quality of assembly might not be as good as an isolate genome assembly. To evaluate the performance of co-assembly of reads from related strains, we created the EColiStrains dataset consisting of 10 million reads from 30 strains of Escherichia Coli. For comparing the assembly performance, we created another dataset with the same number of reads from a single strain, E.Coli strain 536 (represents the isolate assembly), and assembled it using the different k-mer size values used for assembly of the strains dataset. For isolate assembly, k=27 and 29 produced the longest contigs. Figure shows a comparison of isolated and metagenomic strains datasets for k=27 and 29. The contigs in metagenomic assembly are considerably shorter than the isolate assembly, suggesting a severe degradation in assembly quality resulting from the presence of multiple strains. Figure shows the source coverage ratio of the constituent strains for different k-mer size values. A relatively high percentage of the source sequences was recovered by the contigs. Table provides some additional assembly statistics for EColiStrains dataset. The EColiStrains dataset exhibited some of the same general trends as the simLC, simMC, and simHC datasets. But, the variations in the total number of bases and contig accuracies were less pronounced.
Contig length distribution of EcoliStrains dataset and isolate assembly. The total number of contigs (in log scale) shorter than a given cutoff length
Coverage of source sequences for EColi Strains
Paired ended assembly
To evaluate the improvement in assembly quality with mate-pairs information we generated and assembled datasets similar to simHC, simLC and simMC with paired-ended reads of insert length 2000 bases. However we observed that only a small fraction (less than 2 %) of the contigs being produced were using mate pairs information. In addition, most of these contigs were assembled with gaps. For our analysis we broke those contigs at the gaps and treated them as separate contigs. Therefore, because to these two factors, we did not observe a significant improvement in the assembly quality of paired ended reads in metagenomic samples.