When characterizing microbial communities, a critical goal for metagenomic data analysis is recovery of a collection of full-length SSU sequences, each of which represents an operational taxonomic unit. However, when short read metagenomic datasets sampling coexisting organisms are assembled, the SSU genes tend to be highly fragmented and misassembled (Figure ). The resulting short contigs are often composite sequences, not representative of any individual taxon present. Complexity arises because sequences in highly homologous regions co-assemble, while assembly paths diverge where sequence variation exceeds some defined threshold. Identification of the appropriate path is confounded when reads are shorter than the distance between low variation regions. EMIRGE solves these problems by avoiding traditional assembly altogether, probabilistically reconstructing full-length SSU gene sequences from metagenomic datasets. To our knowledge, this is the first report of successful full-length SSU reconstruction from short read metagenomic sequencing data. The method also accurately estimates relative abundances of SSU sequences from each organism type (Figures and ). Of course, like all approaches relying on the SSU rRNA gene for quantification, gene copy number can confound abundance estimates [33
Full-length genes provide more complete taxonomic information than prior tag-sequencing approaches that have sequenced PCR-amplified short hypervariable regions (typically < 200 bp) [3
]. Tag sequencing is subject to potential primer bias [16
] and ultimately relies on underlying phylogenies built from full-length genes. Although the phylogenetic information contained in hypervariable region tags is generally concordant with information contained in full-length sequences at higher taxonomic levels [3
], careful screening of read quality is necessary to avoid overestimating community diversity [34
]. Others have argued that full-length Sanger sequencing of traditional clone libraries remains the only way to adequately construct the phylogeny of life [36
]. The method presented here offers a cost-effective alternative for reconstructing accurate full-length sequences. The increased read depth underlying each reconstructed sequence ensures that no single read or its potential sequencing errors inflate diversity estimates.
One key to the success of EMIRGE lies in the iterative approach encapsulated in the EM algorithm (Figure ). The deep coverage of Illumina sequencing has been used before to iteratively map and correct whole genome consensus sequences from a single species [37
], or a population of closely related strains [38
]. Our approach differs in both the end goal and the statistical approach taken: the EM algorithm models a true population of SSU genes, and constructs only probabilistic descriptions of both the SSU sequences and their underlying abundances. Reference sequences that show evidence (in the form of multiple ambiguous base probabilities above some threshold) of multiple strains are split and allowed to evolve separately, rather than forcing reads into a single composite sequence. Even when iteration ends, base probabilities in the reconstructed SSU sequence can reveal likely single nucleotide polymorphisms in closely related but distinct subspecies in the community (Figure ).
Also central to our approach is the handling of uncertainty by the EM algorithm. This algorithm has a wide variety of applications in high-throughput biological experimentation, which often must deal with hidden data [39
]. Here, rather than try to make a definitive statement about every read, ambiguity created by short reads and high homology within the SSU gene is dealt with probabilistically. Thus, evidence for the sequence and abundance of a particular SSU gene also accumulates probabilistically, with more evidence accumulating from more probable read mappings in each iteration. The result is a set of SSU sequences in which each reconstructed nucleotide has a confidence estimate based on its final probability. The approach was validated by recovery of the anticipated set of sequences from both simulated and natural community datasets (Figures and ) at a level of taxonomic resolution typically used to define operational taxonomic units (OTUs; 97% identity). The algorithm can be tuned to higher levels of stringency (for example, 99%) if desired, an important feature given the diversity of genomes and metabolisms for organisms with similar or even identical SSU sequences.
The benefit of the probabilistic EM strategy is demonstrated by the accuracy of the SSU reconstructions obtained, even when 10% of nucleotide positions were mutated in the underlying SSU database. This robustness of the method to database error means that new taxa can be discovered. For example, we were able to recover a novel Sulfobacillus SSU gene not identified in previous metagenomic and PCR-based analyses of similar biofilm communities. This gene shared only 88% identity with the closest sequence in the starting reference database. We have also applied the EMIRGE method to the description of thermophilic bacterial consortia adapted to grow on switchgrass (JM Gladden et al.: Community dynamics and glycoside hydrolase activities of thermophilic bacterial consortia adapted to switchgrass, submitted). The method recovered full-length SSU genes that corresponded closely to phylogenetic identifications derived from amplicon-based pyrotag sequencing for these communities, and the EMIRGE prior probabilities showed general concordance with the abundance estimates made by pyrotag sequencing.
Our implementation runs overnight on standard hardware available to most labs studying microbial ecology (see Materials and methods). However, the method makes assumptions and choices for computational speed that could be improved upon. In its current form, for example, we have chosen a read mapper [19
] that, while extremely fast, is blind to insertions and deletions (indels). We find that the main effect of this choice is the occasional presence of small indel errors in the reconstructed sequence. In practice, these rare indels have little effect on taxonomic assignment. Future EMIRGE implementations will incorporate a method to handle indels. This will allow for extension of EMIRGE to genes with higher levels of sequence divergence and make it useful for reconstruction of full-length SSU genes from metagenomes sequenced with Roche 454 technology (which is prone to indel errors at homopolymer runs [34
Additionally, although the method adequately corrects nucleotide errors in the reference database, it is possible that chimeric database sequences could carry over into SSU reconstructions, if reads map across the full length of the chimera. None of the EMIRGE-generated sequences reported here were identified as chimeric (data not shown); however, we have documented at least one very low abundance chimera (below the reporting threshold) in the natural community that evolved from a chimeric database sequence. Strict database pre-screening should eliminate this potential problem. EMIRGE-generated sequences would benefit from the same downstream quality control applied to traditional clone libraries.
As it is described here, EMIRGE does not suffer from potential primer bias introduced by so-called 'universal' primers [16
], allowing for discovery of novel species that may not have canonical primer binding sites [18
]. Like other methods that use next generation sequencing technologies [3
], the method also removes potential cloning bias introduced with Sanger sequencing. However, it may be subjected to other biases associated with new technologies - for example, the under-representation of sequences at the extremes of GC content [40
In its current form, the analysis relies upon only the small fraction (< 0.2% here) of reads that derive from SSU genes. However, the EMIRGE algorithm could easily be applied to full-length SSU amplicon datasets, enabling confident reconstruction of full-length SSU genes from extremely low-abundance organisms. We focused here on demonstrating the accuracy of the method for reconstruction of SSU sequences for OTUs representing ≥ 1% of the population. If the method scales linearly, we can expect to recover accurate sequences and abundances from SSU genes from organisms representing just 0.002% of similar populations with a single lane of Illumina amplicon sequencing.