Microorganisms are vital components of our planet's ecosystems. PCR amplification and sequencing of 16S ribosomal RNA (16S rRNA) genes directly from environmental samples has over the last two decades revealed an astonishing amount of new microbial diversity
[1],
[2]. However, as the ‘universal’ primers used in PCR are designed based on already known groups of organisms, a skewed picture of community composition is likely obtained, especially for environmental samples containing divergent microbial lineages
[3].
Direct sequencing of total environmental DNA (metagenomics) has the potential to assess the true diversity of the environment without primer bias
[4],
[5]. Metagenomic sequences can be assigned to taxa using their similarity to reference genomes based on either sequence similarity
[6]–
[9] or genomic composition
[10]–
[13]. However, these types of assignments are only informative when the genomes of closely related taxa are present in the reference set. As reference genomes are only available for a limited part of the phylogenetic tree of life
[14], these taxonomic predictions are generally of low resolution (e.g. phyla or order) and hence often give only an unsatisfactory description of community composition.
In contrast, several comprehensive databases exist for the 16S rRNA gene that provide detailed phylogenetic trees
[15] and allow for taxonomic resolution down to the species level
[16]. Shotgun metagenomic datasets obviously also contain fragmented 16S rRNA genes and these have been directly assigned to taxa through BLAST-based comparisons
[4] or phylogenetic distance-based clustering
[17]. However, the short and random nature of metagenomic sequences may not contain the phylogenetically most informative regions of the 16S rRNA genes, thus diminishing the efficiency of taxonomic assignments. Sequence assembly can potentially increase the length of the 16S rRNA gene sequences recovered
[18], but low sequence coverage may limit assembly success for 16S rRNA genes and low-stringency assemblies may result in chimeric sequences
[19],
[20]. The recently released EMIRGE software uses iterative mapping of short Illumina reads against reference sequences to reconstruct 16S rRNA genes
[19]. Although this approach has an explicit accuracy to single nucleotide difference, its potential to avoid chimeras is strongly dependent on the quality of the reference database. Further, EMIRGE's algorithm is currently not designed for pyrosequencing reads, which contain high rates of insertion and deletions errors (e.g. in homopolymers)
[21]. There is thus a need for an approach that reconstructs 16S rRNA genes with high accuracy from pyrosequencing data.
In the present study, we describe a strategy to reconstruct nearly full-length 16S rRNA sequences from metagenomicpyrosequencing data. Through simulation of communities with different diversities we developed a process of stringent assembly and data filtering that generates 16S rRNAcontigs with minimal chimera rates. We then applied our process to assess the microbial symbiont communities from two marine sponges species and compared the outcome to PCR-based assessments of the community structure (pyro-tag-sequencing). We show that about 30% of the abundant phylotypes reconstructed from metagenomic reads failed to be amplified by PCR, which is most likely due to primer mismatches.