The field of metagenomics examines the functional and phylogenetic composition of microbial communities in their natural habitats and allows access to the genomic content of the majority of organisms that are not easily cultivatable
[1]. This is achieved through extraction of genomic DNA directly from environmental samples followed by sequencing, assembly and data analysis. Metagenomics has lead to the characterization of microbial communities in a variety of habitats on the earth: for example, the ocean
[2]–
[3], soil
[4]–
[5], hot springs
[6] and acid-mine drainage ponds
[7]–
[8]. More recently the human microbiome, in particular the gastro intestinal tract
[9]–
[11], gained considerable attention and large-scale metagenomic initiatives now promise to characterize the microbiota in many different body sites with an ultimate goal of understanding human health and disease (e.g.
[12]). The very first projects used Sanger sequencing, and even though Sanger sequencing is used less and less due to the advent of less expensive next generation sequencing, it still can reveal novel biological concepts
[11]. In addition, reanalysis of Sanger sequencing data have led to a number of recent discoveries
[13]–
[15]. Yet, the currently two most prominent sequencing methods used for metagenomics are pyrosequencing
[16]–
[17] and most recently Illumina sequencing
[10] enabling studies of a wide array of ecosystems, with the consequence of an exponential increase in environmental sequencing
[18].
The initial steps in metagenomic data analysis involve the assembly of DNA sequence reads into contiguous consensus sequences (contigs), followed by prediction of genes. The protein-coding genes are then used to predict the functional repertoire encoded in the metagenomes and the phylogenetic composition can be estimated using a variety of methods
[19]. Data analysis pipeline tools like SmashCommunity
[20], MG-RAST
[21], IMG/M
[22] and Metarep
[23], are complemented by numerous special purpose tools, and they all need to be validated. As there is no completely annotated metagenome available, simulations based on genomic data provide the currently only feasible way to get close to the truth. Indeed a number of simulations have already been performed in metagenomics. Mavromatis and colleagues
[24] simulated metagenomic data by sampling sequencing reads from isolate genomes and then benchmarked assembly and annotation tools for Sanger-sequenced metagenomes. In addition, some simulator software has been developed that allows users to create metagenomes with desired properties: MetaSim
[25], Grinder
[26] and NGSfy
[27].
Here we investigate the fidelity of metagenomic assemblies of next generation sequencing methods (pyrosequencing and Illumina) and compare these to classical Sanger sequencing as well as to previous results. To enable this, we developed two new metagenomic simulators iMESS (for Sanger and pyrosequencing) and iMESSi (for Illumina) that not only provide realistic sequencing reads, but also simulate errors and corresponding quality values based on actual metagenomic data. The simulated metagenomes were used to benchmark currently used assembly protocols. Due to the current uprise of Illumina sequencing in metagenomics, we also assessed the impact of quality control as well as the use of scaffolding in metagenomics. The simulators are freely available to allow the design of custom metagenomic data, and in order to allow researchers to benchmark new tools using these datasets the raw and assembled data are available at
http://www.bork.embl.de/~mende/simulated_data/.