|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: ASW DRM SS JR PB. Performed the experiments: ASW DRM SS AIJ MMC MA JR. Analyzed the data: ASW DRM SS. Contributed reagents/materials/analysis tools: ASW DRM SS AIJ MMC MA. Wrote the paper: ASW DRM PB.
Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.
The field of metagenomics examines the functional and phylogenetic composition of microbial communities in their natural habitats and allows access to the genomic content of the majority of organisms that are not easily cultivatable . This is achieved through extraction of genomic DNA directly from environmental samples followed by sequencing, assembly and data analysis. Metagenomics has lead to the characterization of microbial communities in a variety of habitats on the earth: for example, the ocean –, soil –, hot springs  and acid-mine drainage ponds –. More recently the human microbiome, in particular the gastro intestinal tract –, gained considerable attention and large-scale metagenomic initiatives now promise to characterize the microbiota in many different body sites with an ultimate goal of understanding human health and disease (e.g. ). The very first projects used Sanger sequencing, and even though Sanger sequencing is used less and less due to the advent of less expensive next generation sequencing, it still can reveal novel biological concepts . In addition, reanalysis of Sanger sequencing data have led to a number of recent discoveries –. Yet, the currently two most prominent sequencing methods used for metagenomics are pyrosequencing – and most recently Illumina sequencing  enabling studies of a wide array of ecosystems, with the consequence of an exponential increase in environmental sequencing .
The initial steps in metagenomic data analysis involve the assembly of DNA sequence reads into contiguous consensus sequences (contigs), followed by prediction of genes. The protein-coding genes are then used to predict the functional repertoire encoded in the metagenomes and the phylogenetic composition can be estimated using a variety of methods . Data analysis pipeline tools like SmashCommunity , MG-RAST , IMG/M  and Metarep , are complemented by numerous special purpose tools, and they all need to be validated. As there is no completely annotated metagenome available, simulations based on genomic data provide the currently only feasible way to get close to the truth. Indeed a number of simulations have already been performed in metagenomics. Mavromatis and colleagues  simulated metagenomic data by sampling sequencing reads from isolate genomes and then benchmarked assembly and annotation tools for Sanger-sequenced metagenomes. In addition, some simulator software has been developed that allows users to create metagenomes with desired properties: MetaSim , Grinder  and NGSfy .
Here we investigate the fidelity of metagenomic assemblies of next generation sequencing methods (pyrosequencing and Illumina) and compare these to classical Sanger sequencing as well as to previous results. To enable this, we developed two new metagenomic simulators iMESS (for Sanger and pyrosequencing) and iMESSi (for Illumina) that not only provide realistic sequencing reads, but also simulate errors and corresponding quality values based on actual metagenomic data. The simulated metagenomes were used to benchmark currently used assembly protocols. Due to the current uprise of Illumina sequencing in metagenomics, we also assessed the impact of quality control as well as the use of scaffolding in metagenomics. The simulators are freely available to allow the design of custom metagenomic data, and in order to allow researchers to benchmark new tools using these datasets the raw and assembled data are available at http://www.bork.embl.de/~mende/simulated_data/.
iMESS is a metagenomic simulator for Sanger sequencing as well as pyrosequencing. Users can generate Metagenomes through an easy-to-use website at http://www.bork.embl.de/software/iMESS. First the user has to specify a desired community structure, by selecting the number of organisms and the shape of the rank abundance curve. Next the sequencing method and the amount of sequencing (number of reads) have to be specified. Using these parameters the simulator calculates how many reads of each organism's genome should be sequenced. iMESS then calls ReadSim (http://ab.inf.uni-tuebingen.de/software/readsim/) to generate reads with sequencing errors but without quality values. Quality value models were determined by obtaining quality values from actual metagenomic reads and fitting a function. For Sanger sequences quality models were generated for 3 different data sets: JGI  , TIGR , JAP . For pyrosequencing one model was determined based on reads from a real dataset (unpublished) with a 250 bp median length. Read sequences and quality values are written to a .fasta and a .qual file. For more details please refer to the iMESS manual online.
iMESSi is a metagenomic simulator for the Illumina sequencing technology. It can be downloaded at: http://sourceforge.net/projects/cmessi/. Similar to iMESS the user first specifies what kind of community should be simulated. The user also has to specify a number of other parameters including the total number of inserts, the read length, the insert size and standard deviation of the read length. The actual number of inserts sampled from each genome is calculated in a similar fashion as done in Metasim . To generate realistic quality values we obtained quality values from the MetaHIT gene catalog dataset  and clustered the runs by quality values using Euclidean distances. This resulted in 3 different clusters for 75 bp reads and one for 44 bp reads. To simulate errors within the reads the quality values are then mapped to a ‘read’ extracted from a reference genome and then random errors were generated at the probability as defined by the equation below.
Q=−10 log10 (P/(1−P)), where Q is the Phred quality score and P is the error probability .
The 4 error models for 75 bp reads show large differences in average error. The error models are available at: http://sourceforge.net/projects/cmessi/, but users can easily generate their own error models by extracting the quality values from any Illumina sequencing run and converting them to Sanger scale Phred scores in .qual format. This enables users to generate realistic data for their sequencing machine and protocol and enables simulations of Illumina reads from any sequencing read length used by Illumina sequencing. The sequences and their assigned quality values are returned as fastq formatted files .
Metagenomic datasets were simulated for Sanger sequencing, pyrosequencing, and Illumina sequencing. For each sequencing technology, three metagenomes were simulated to mimic different community complexities (10, 100 and 400 genomes). (Table 1, Table S1, S2, S3). We generated metagenomes of the three sequencing platform at different sequencing depths in order to account for the price difference between the three sequencing technologies and the usual sequencing effort for metagenomic projects using each technology. Thus, about 15 times more base pairs were generated for Illumina than for pyrosequencing, to reflect the lower cost associated with Illumina sequencing , and similarly 1.3 times more for pyrosequencing than for Sanger. The datasets were assembled and analyzed using SmashCommunity (pyrosequencing and Sanger) or a pipeline using freely available tools (i.e fastx toolkit , SOAPdenovo  and parts of the SmashCommunity pipline) (Illumina).
Sanger and pyrosequencing reads were quality trimmed using lucy  and the lucyTrim.pl script from OCTUPUS (http://octupus.sourceforge.net). Illumina reads were quality trimmed and filtered using the procedure described in . Specifically, the 5′-ends of the reads were trimmed so that the abundance of each base (A,C,T,G) per position was within 2 standard deviations of the average across all cycles. Then, all bases with a quality score less than 20 were trimmed off the 3′-ends. Lastly, all reads that were shorter than 35 bases or had a median quality score below 20 were removed.
Trimmed and untrimmed reads were mapped to the original genomes with MOSAIK 1.1.0021 (http://bioinformatics.bc.edu/marthlab/Mosaik) using the following parameters: “-a all -m all -hs 15 -mm 2 -act 35”. This software was also used to calculate the coverage of the genomes.
All simulated metagenomic datasets using Sanger and pyrosequencing technologies were assembled using SmashCommunity . The Sanger sequencing data was assembled using Arachne v3.1  with SmashCommunity standard parameters. Pyrosequencing datasets were assembled using Celera  assembler with SmashCommunity standard parameters.
The Illumina datasets were assembled using SOAPdenovo 1.05  using following parameters: “-K 23 -L 70 -M 3 -u -R -F”. To assess the effect of quality filtering in metagenomic data analysis of Illumina data, we assembled the datasets with and without quality filtering, as described above.
To determine which read was incorporated into which contig we used this information provided by SmashCommunity for all datasets processed with this tool. In order to get this information for the Illumina datasets we mapped the reads against the contigs using SOAPaligner 2.20 (Parameters: “-r 0 -v 2 -M 2”) .
Chimeric contigs are those contigs that combine reads originating from more than one genome. This definition was originally based on assemblies of Sanger reads. In contrast to assemblies of Sanger reads, in assemblies of Illumina data reads can be assigned to more than one contig as an entire read may be identical (or nearly identical) to two reference genomes. Therefore to adjust the definition of a Chimeric contig to Illumina data, contigs were only considered to be chimeric if they contained uniquely-mapping reads that originate from more than one genome. Uniquely-mapping reads are those that are only mapped to one contig, as opposed to being mapped to multiple contigs. For all chimeric contigs we calculated the degree of chimericity as described in . Specifically, the degree of chimericity is the ratio of the number of reads that do not originate from the species which makes up most of the reads in the contig over the total number of reads in that contig.
In order to determine how accurately contigs represent the corresponding genomes, we defined the Contig Score. To calculate this we used BLASTN to map contigs to the original genomes. We then extracted the percent identity for the best HSP as well as the percent of each contig covered by its HSP. The Contig Score was then calculated by multiplying these two values and normalized to be in a range from 0 to 100.
Functional annotation and analysis was done using SmashCommunity . Briefly, gene prediction was performed using MetaGeneMark , and then the protein translations of the predicted genes were assigned to a COG (Cluster of Orthologous Group) by performing a BLASTP against the eggNOG2 database (single best hit, bit score >60) . The abundance of each COG in each metagenome was determined using scripts in SmashCommunity. The abundances were normalized to produce probability distributions. To determine the similarity or difference between the COG abundance distributions, Principal Coordinate Analysis (PCoA) was performed using a distance metric related to Jensen-Shannon Divergence (JSD) . For a complementary ordination analysis, Principal Component Analysis (PCA) was performed using the COG abundance distribution matrix using R . In addition, the abundance of each COG, as determined from the metagenomic assembly and annotation, was plotted against the abundance of each COG as expected from the input genomes and Pearson correlation coefficients were determined.
Both metagenomic simulators presented here generate sequences and corresponding quality values that were modeled on actual metagenomic data. Quality values were originally developed for Sanger sequencing to estimate the accuracy of each base call . Most assemblers for Sanger and pyrosequencing use quality values as part of the assembly process (e.g. Phrap (http://www.phrap.org/), Arachne , JAZZ , Celera , Newbler ). However, the most commonly used assemblers for Illumina data (SOAPdenovo , Velvet , SSAKE  do not use quality values for assembly. Thus there is no standard treatment for poor quality bases of Illumina reads. To evaluate the impact of quality-based pre-processing of reads, the 3 Illumina metagenomic datasets were assembled with and without quality control. We chose quality control which included trimming the 5′-end based on base frequency distributions, the 3′-end based on quality scores, and then removal of reads based on median quality scores and minimum length. This is more rigorous than the quality control performed with the first published Illumina metagenomic data set . Although 13–16% of the reads and 24–27% of the base pairs were removed by quality trimming, the accuracy of the data was greatly improved (Table 2). To assess how well the reads represent the genomic sequences present in the metagenome, the reads were mapped back to the source genomes allowing for a maximum of 2 mismatches. Even though the total number of reads was lower in the trimmed dataset, the total number of reads that mapped to the original genomes doubled. The quality trimming of the reads also produced a strong improvement in the assemblies (Figure 1, Table 2). Notably, for the 10 species metagenome, prior to quality trimming no contigs longer than 500 base pairs (bp) were obtained, while after trimming 13799 contigs longer than 500 bp were assembled with an average length of 2332 bp. For the 100 genomes metagenome the number of contigs longer than 500 bp increased by almost 3-fold and the N50 more than doubled. While, for the 400 genomes metagenome improvements to the assembly due to quality filtering were more modest. Increasing the number of contigs longer than 500 base pairs will strengthen the ability to predict and annotate protein-coding genes by both using homology- and neighborhood-based  methods. Assembly of the trimmed dataset clearly outperforms the assembly of the untrimmed dataset demonstrating that stringent quality control as performed here should be used for real metagenomic sequencing data in order to enhance results.
We used different assembly programs for reads generated by each sequencing technology in order to account for the differences between them. Therefore, the following is a comparison between assemblies produced from different sequencing methods, along with a chosen pipeline of assembly software with specified parameters. We used parameters which are optimized for metagenomic assembly as well as for each technology and that were used in previous studies  .
A comparison of the assemblies shows that contig size length distributions differ depending on the community complexity and the sequencing technology (Figure 2, Table 3). For simple communities (10 genomes) all sequencing platforms produced a similar total sum of contig lengths, but differed in the distribution of contig lengths. Although pyrosequencing had the longest N50 (N50 is defined as the length N for which 50% of all bases are represented in fragments of length L<N) , Sanger sequencing produced the largest number of contigs greater than 500 bp. For more complex communities (100 genomes), Illumina reads resulted in by far the best assembly with 8 times the number of contigs assembled than as for the Sanger sequences (the next best), the largest proportion of long contigs, and over 6 times more genes with functional annotations. For the most complex community (400 genomes) there was very little assembly using any technology. Although there was no assembly of Sanger reads into contigs, the Sanger reads still represent the best ‘assembly’ with 10 times more fragments over 500 bp, than the Illumina assembly and over 10 times more genes with functional annotation. These differences can be attributed to the differences in read length and especially sequencing depth, but both parameters are intrinsic to the different sequencing technologies., as the cost of the sequencing technology is directly related to the sequencing depth. All of the sequencing technologies perform comparably when the coverage per organism is relatively high as in the 10 genomes metagenome. But for the more complex communities (100 genomes) Illumina performs better due to the greater sequencing depth achieved. However, for the metagenomes with 400 genomes even the sequencing depth achieved with Illumina does not make up for the lower coverage of each genome and Sanger performs well due to the long read lengths.
The definition of a chimeric contig arose from analysis of Sanger assemblies and describes a contig that combines reads originating from more than one genome . In this case, reads that originate from two different genomes may be combined into one contig based on a short region of homology between the two, while the majority of the contig would match one of the reference genomes better than the other. However for Illumina reads the definition of a chimeric contig is not as clear. As Illumina reads are so short, an entire read may be identical (or nearly) to two reference genomes. In addition, most Illumina assembly software allows reads to be assigned to more than one contig. Thus, for Illumina data we defined a contig as chimeric if it contains uniquely-mapped reads that originate from more than one genome; where uniquely mapped reads are those that are only mapped to one contig, as opposed to being mapped to multiple contigs. Moreover, in order to assess the accuracy of a contig without using the concept of chimericity, which may not be so informative for Illumina data, we defined the term ‘contig score’ to represent the sequence identity between a contig and its corresponding genome. The contig score can vary between 0 and 100, with 100 being the best value.
The percentage of chimeric contigs was lowest in Sanger sequencing, while pyrosequencing and Illumina had a much higher percentage of chimeric contigs (Table 2, Figure 2). However, the degree of chimericity (percentage of reads from “wrong” genomes ) was on average higher for Sanger sequencing than the other sequencing technologies, with the effect being more pronounced in the least complex community, and almost negligible in the most complex community. The degree of chimericity was clearly dependent on the contig length for all sequencing methods, with shorter contigs having a much higher degree of chimericity. The percentage of chimeric contigs and the degree of chimericity both increase with increasing community complexity. This shows that some contig accuracy characteristics known from Sanger sequencing  also hold true for next generation sequencing methods.
By comparing the contig scores to the chimericity analysis, it is clear that chimeric contigs can still resemble the original genomes (Figure 2, Panel B). This is especially true for Illumina contigs that had overall better contig scores than other sequencing method for all community complexities. Conversely, pyrosequencing produced assemblies with the lowest contig scores. In agreement with the relationship between contig length and chimericity, we also found a proportional relationship between contig length and the contig score in most cases, delivering additional evidence that longer contigs are indeed more reliable. However, for the 10 genome community the trend was not as clear for the contigs from pyrosequencing, and a number of long contigs had a low contig score.
The assembly of reads into contigs usually does not lead to completely assembled genomes, hence scaffolding is used to combine contigs and place them within context of their genomic location . Scaffolds are constructed by linking contigs using information from paired end reads. During this process a number of unknown bases, or gaps, are usually found between the sequences of the linked contigs. Some scaffolding tools try to fill this gapped-region with unused reads. Unknown bases that remain between the contigs in the scaffold will be represented by Ns. To use the information obtained by scaffolding, scaftigs can be constructed by extracting the contiguous sequences that lack unknown bases (Ns). Scaffolding is especially useful when assembling short reads generated using next generation sequencing technologies since repeats are harder to resolve in this case. The main advantage of scaftigs over contigs is an increase in fragment lengths and scaftigs have been proven to be useful in metagenomic data analysis .
We used simulated Illumina data to survey the effect of scaffolding on assemblies of different communities. Fragment lengths increase with scaffolding. This is most pronounced in the low complexity metagenomes (10 genomes) where the N50 increases 10-fold from 3240 to 35893, while there is hardly a difference in the high complexity (400 genomes) dataset (N50 improves from 631 to 690) (Table 4, Figure 3). Although the use of scaffolds increases contig lengths, a larger proportion of scaftigs were chimeric than contigs in simulations of all community complexities. The degree of chimercity was also higher in scaftigs than in contigs for all length bins (Figure 3). While the effect of scaffolding seems to be detrimental when looking at chimericity, the effect is small when comparing the actual sequence of scaftigs and contigs to the original genomes. The percentage of sequences having a contig score below 95 slightly increases for all 3 community complexities; however, the median contig score was very similar for scaftigs and contigs (Table 4). In conclusion, scaffolding of Illumina data represents a tradeoff between increased fragment lengths and accuracy, and therefore might be more useful when mapping fragments for function assignment purposes but less so when sequence identify is used to quantify the distance to reference genomes.
To determine if better assembly parameters such as longer contigs result in assemblies that more accurately represent the functional composition of the community, we compared the functional content of the metagenomes to the functional repertoire expected from the input genomes. For the 10 and 100 genome metagenomes, where the genomes had higher coverage, the actual COG abundances determined from the assemblies correlated well with the expected COG abundances (Figure 4A). For the more complex communities (100 and 400 genomes) the Illumina scaftigs had slightly better correlations than the Illumina contigs. This indicates that the positive impact of increased fragment length outweighs the minor increases in the number of assembly errors. In addition we performed principal component analysis and principal coordinate analysis of JSD distances to determine how the use of different technologies might affect functional ordination analyses. Both ordination analyses showed that for all sequencing technologies the low complexity metagenomes (10 genomes) were similar in functional content to each other as well as to the expected (Figure 4B). For the medium complexity metagenomes (100 genomes), the Illumina reads produced assemblies that were very similar to expected, with the Sanger assembly also being close and the pyrosequencing assembly being the most different from expected. For the high complexity community (400 genomes) the Illumina and pyrosequencing metagenomes appear to be quite divergent from the expected functional content. And the Sanger reads provided the best representation of the functional content with the Sanger assembly appearing relatively similar to expected in the PCoA. One of the main reasons that pyrosequencing could not accurately represent the overall functional composition was the lower number of genes that were annotated (Table 3). In addition, for all metagenomes the Illumina contigs and Illumina scaftigs had very similar functional compositions. Overall, the functional analysis shows that better assemblies (eg. more complete genes) do actually result in better functional characterization of a metagenome.
The first step in any metagenomic data analysis should be raw data treatment. This includes quality control and removal of contamination (eg. human contamination in metagenomic studies of the human gut). Tools for NGS quality control like the FASTX toolkit , SolexaQA  or PrinSeq  are readily available. However, there is currently no standard protocol for how the quality values should be used in read pre-processing. Our results show the importance of good quality control as the Illumina assemblies greatly improved after rigorous quality filtering and trimming.
After this initial step there are a number of analyses that one can perform, but to decrease the complexity of the data, the quality controlled reads are usually assembled into contigs. Our comparison of assemblies from different sequencing technologies reveals that each assembly has different characteristics depending on community composition and sequencing technology. The different sequencing technologies performed similarly for the low complexity community. For the more complex community (100 genomes), the Illumina sequences assembled a much greater length of contigs resulting in more complete genes. This is due to the greater sequencing depth achieved as the price per base is much less for Illumina sequencing than pyrosequencing or Sanger. For the most complex community none of the sequencing technologies assembled many reads into contigs. However, due to their length the Sanger reads still had many sequences greater than 500 bp. Earlier metagenomic simulation studies focused on the chimericity of contigs concluding that currently used assemblers need to improve to be useful for metagenomics . However, this term that originated from Sanger sequencing may not be as applicable to IIlumina data. This is because Illumina reads are so short and may actually represent regions that are identical between two organisms, and because assembly of Illumina reads often results in reads being assigned to more than one contigs. We therefore need to asses new ways to determine the accuracy of Illumina assemblies. Accordingly, we defined the term ‘Contig Score’ to quantify the divergence of the contigs from the original genomes. Our results show that for all sequencing technologies and community complexities the vast majority of the contigs diverge by less than 5% from the original genomes. The Illumina dataset excelled using this measure showing the usefulness of Illumina sequencing data in metagenomics. The reliability of most contigs is reflected by the fact that the functional repertoire of the low and medium complexity metagenomes accurately represents the expected functional repertoire. This is also because the amount of sequence produced allowed for all of the sequencing technologies to provide enough coverage of each genome. For the most complex community, where there was low coverage of each genome, assemblies from Illumina and pyrosequencing failed to represent the expected functional composition of the metagenomes, as there were very few complete genes annotated. However, the Sanger reads approximated the expected functional composition reasonably well as the length of the reads allowed for accurate functional annotation. For the pyrosequencing simulations 250 bp was used as the average read length. Currently, the GS FLX Titanium system can deliver reads as long as 400 bp, this would probably improve the assemblies. The ability of all NGS sequencing technologies to fully capture the functional repertoire of complex communities will also improve as technology developments might lower prices allowing for deeper sequencing.
By using paired end information available in most NGS technologies, the fragment length of an assembly can be increased by gap-filling and scaffolding of contigs. Our results show that scaffolding is a good way to increase fragment lengths. And although scaffolding increased chimeras and decreased the Contig Score, the functional profiles of the metagenomes derived from contigs and those derived from scaftigs were virtually indistinguishable, with the COG abundance profiles of the scaftig-metagenomes correlating slightly better with the expected.
Currently, Illumina sequencing technology can produce the greatest yield at the lowest price , but as of now has not been extensively used for metagenomics. Our study of simulated metagenomes shows that Illumina data can be used to obtain assemblies that, for the low and medium complexity metagenomes in this study, are superior to those from pyrosequencing and Sanger sequencing, provided a rigorous quality control of reads prior to assembly. However, the assembly performance is coupled to the underlying community structure, and thus simulations will aid in choosing the optimal sequencing technology for a microbiome of interest. In addition, the results are highly dependent on the sequencing depth and read lengths which are increasing for next-generation sequencing technologies, thus it is likely that they may perform even better for more complex metagenomes in the future.
Genomes Used in the Low Complexity Metagenome and Estimated Coverage (10 genomes).
Genomes Used in the Medium Complexity Metagenome and Estimated Coverage (100 genomes).
Genomes Used in the High Complexity Metagenome and Estimated Coverage (400 genomes).
We thank Y. Yuan and the EMBL IT core facility for managing the high-performance computing resources. We are also grateful to all members of the Bork group for discussions and assistance.
Competing Interests: The authors have declared that no competing interests exist.
Funding: This research received funding from MetaHIT, the Natural Sciences and Engineering Research Council of Canada (funding A.S.W.) and European Molecular Biology Laboratory. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.