|Home | About | Journals | Submit | Contact Us | Français|
Identifying bacterial strains in metagenome and microbiome samples using computational analyses of short-read sequence remains a difficult problem. Here, we present an analysis of a human gut microbiome using on Tru-seq synthetic long reads combined with new computational tools for metagenomic long-read assembly, variant-calling and haplotyping (Nanoscope and Lens). Our analysis identifies 178 bacterial species of which 51 were not found using short sequence reads alone. We recover bacterial contigs that comprise multiple operons, including 22 contigs of >1Mbp. Extensive intraspecies variation among microbial strains in the form of haplotypes that span up to hundreds of Kbp can be observed using our approach. Our method incorporates synthetic long-read sequencing technology with standard shotgun approaches to move towards rapid, precise and comprehensive analyses of metagenome and microbiome samples.
As yet, only a small fraction of the microbial world has been isolated and studied in the laboratory and little is known about species that cannot be cultured1. Metagenomics has begun to shed light on this unculturable ‘microbial dark matter’ by sequencing the DNA of microbial communities directly from the environment2. Metagenomic analyses of soil3, water4 and human microbiome5 samples have already increased our understanding of the microorganisms present in these environments.
Although short-read sequencing technologies have enabled high-throughput metagenomic studies, the limited read lengths combined with the complexity of microbial samples (hundreds to thousands of species) can make it difficult to accurately identify bacterial strains, recover whole genomes, catalog sample diversity and assess the abundance of species and strains.
Many approaches have been proposed to overcome these problems. Binning de novo assembled contigs using metrics such as tetranucleotide frequencies6, genomic coverage under different DNA extraction methods7 or abundance correlations between samples from multiple individuals8 have all been used to identify species and recover their genomes. However, each approach has limitations: tetranucleotide frequencies may vary within the same species6, closely related bacteria have similar DNA extraction efficiencies7, and establishing abundance correlations among individuals may require a large number of samples8. Metagenomic contigs can also be scaffolded using chromatin-level contact probability maps generated by the high-throughput chromosome conformation capture (Hi-C) technology9; however, Hi-C has high input-DNA requirements and the performance of this scaffolding method has not yet been assessed on bona fide high-complexity metagenomes. An alternative technology, single-molecule real-time sequencing10, has been used to sequence 16S rRNA amplicons11, but has seen limited application to whole metagenomes.
Recently, Tru-seq synthetic long-read sequencing has been developed to increase the effective read length available using the Illumina platform from hundreds to more than ten thousand base pairs. This method uses a modified library-preparation protocol, in which kilobase-long DNA fragments are extracted, diluted, amplified, and reassembled from regular short reads12. Synthetic long-read sequencing has been used in metagenomics for assembly validation13 and for studying environmental metagenomes14. Here, we report the first study of the human gut microbiome using synthetic long reads.
We present three improvements compared with previous technologies. First, we demonstrate that long reads, in conjunction with Lens, an algorithm developed for this study, reveal extensive haplotype diversity among individual bacterial strains of the same species. This level of resolution was inaccessible using previous technologies, which at best studied microorganisms at the strain level. Second, we use long reads to assemble de-novo hundreds of megabases of genomic sequence in contigs of complete operons and whole bacterial chromosomes. Finally, we determine microbial composition accurately.
We applied our long read sequencing approach to two metagenomic datasets: the human microbiome project staggered mock metagenomic community5 (mock metagenome), and a sample from the gut of a healthy male adult individual (human gut metagenome). The mock metagenome is a synthetic community of 20 organisms with known reference genomes (Supplementary Table 1) that is widely used for validation. We generated three Tru-seq synthetic long read libraries (2.9 Gbp of sequence, N50 read length of 9.2 kbp; Supplementary Figure 1) for this dataset, in addition to 3.1 Gbp of standard 101-bp paired-end Illumina short read libraries (Supplementary Table 2). For the human gut metagenome, we generated seven Tru-seq synthetic long read libraries (8.3 Gbp of sequence, N50 read length of 8.6 kbp; Supplementary Figure 2), and complemented this data with 8.1 Gbp of standard Illumina libraries (Supplementary Table 3). The similar sequencing amounts of short and long reads for both samples helped assess the benefits of using longer read lengths.
We maped the long reads to the known reference genomes of the mock community using the MUMmer aligner. Accuracy was high, with less than 0.5% of reads misassembled (Supplementary Table 4). However, we observed differences in coverage between short- and long-read technologies, with long reads covering ~ 15% fewer base pairs overall than short reads (Supplementary Tables 5, 6; Supplementary Figure 3). In particular, four organisms were highly covered (>98%) by short reads but long read coverage was substantially less thorough (<75%), suggesting that long reads have more sequence bias. Interestingly, six organisms had at least 10% of their genomes covered by long reads but not by short reads, suggesting that the two technologies can be complementary (Supplementary Table 7). We also found that abundance estimates were substantially different between the technologies (sometimes by more than an order of magnitude), indicating that long reads on their own may be insufficient for abundance estimation (Supplementary Figure 4). Differences in coverage have been previously linked to the long reads’ increased sensitivity to GC content during the PCR amplification step12,19. We suggest that for best results, both types of data should be used.
We assembled long reads from the human gut metagenome sample using Nanoscope, a new bioinformatics pipeline we created. Nanoscope automates metagenomic assembly, species identification, substrain analysis, and abundance estimation from a combination of short and long read data (Figure 1; Supplementary Code). It is available online as a free open-source tool (https://github.com/kuleshov/nanoscope).
Nanoscope starts by invoking the Soapdenovo20 and Celera21 assemblers to independently assemble the short and long read libraries, before merging the results using Minimus222. This method produced contigs for more than 650 Mbp of the human gut metagenome (N50 length of 49 kbp; Table 1); these were longer and more complete than ones assembled from either long reads (600 Mbp of sequence; N50 length of 38 kbp; Celera assembler) or short reads (232 Mbp of sequence; N50 length of 8.6 kbp; Soapdenovo2 assembler) alone. Twenty-two of the contigs we obtained were longer than 1 Mbp (Supplementary Figure 5), indicating that multiple organisms could be assembled completely or almost completely. The longest contig we recovered was 3.9 Mbp; its length and number of predicted ORFs23 were comparable to that of a closely related complete bacterial genome (see below). For comparison, long reads by themselves produced 19 contigs longer than 1 Mbp, whereas the longest contig from short reads alone was 410 Kbp. This indicates that synthetic long reads assemble a small number of complete chromosomes in addition to dozens of contigs that are almost an order of magnitude longer than ones obtained from short reads.
We also assessed our assembly strategy using the mock metagenome. Our merged assembly of long- and short-reads recovered 42 Mbp of sequence (of 83 Mbp total) into contigs with an N50 length of 92 kbp (Table 1, Supplementary Table 8). This assembly was quite accurate, with approximately one misassembly per 400 kbp on average (Supplementary Table 9), a rate that compares favorably to accuracies reported by previous empirical studies of de-novo assembly algorithms24. Using only long reads resulted in much shorter contigs (33 Mbp of sequence; N50 length of 43 kbp), suggesting that for low-complexity metagenomes, combining short- and long-read technologies might substantially improve assembly quality. We assembled on this dataset one contig longer than 1Mbp and three contigs longer than 500 Kbp, all of these were assembled using long reads alone, suggesting that short reads mainly improved assembly completeness rather than contiguity.
One of the potential advantages of long-read sequencing is the recovery of complete bacterial operons (clusters of functionally related genes that are transcribed together). By using the known positions of operons in the reference genomes of the species in the mock metagenome25, we confirm that operon recovery is feasible with a combined assembly of short- and long- reads (Supplementary Table 10). Our combined assembly recovered 4,500 operons, which represents more than half of all the known operons in the mock metagenome and twice the number that can be obtained using short reads alone. Interestingly, long and short reads by themselves reconstructed only about 2,500 operons each, and many could be assembled from only one dataset (Supplementary Figures 6, 7). We attribute this discrepancy to differences in coverage between short and long reads.
In particular, long reads enabled us to recover long flagellar operons present in E. coli (Supplementary Table 11); these operons are clinically relevant as flagella contribute to pathogenicity. We assembled complete sequences of 11 flagellar operons from three bacterial species (E. coli, R. sphaeroides, P. aeruginosa; Supplementary Table 12), which comprised half of the known flagellar operons in the mock metagenome. We also recovered multiple flagellar operons from the gut metagenome (Supplementary Table 13). For example, a 2.3 Mbp contig belonging to the genus Acinetobacter was found to contain 10 flagellar operons, the longest of which contained 11 genes.
We assessed variation among the bacterial strains whose genomes we assembled using Lens, a new tool that we created. (Table 2). In brief, we mapped long reads to assembled contigs using the BWA aligner; at many positions, read and contig sequences differed, which we interpreted as variation among recently diverged strains of the same species. We used Lens to determine and phase single-nucleotide variants (SNVs) and short indels based on this alignment; Lens performed these tasks via new algorithms that do not make any assumptions on either the read length or the ploidy of the organism (see Supplementary Methods; Supplementary Code; https://github.com/kuleshov/lens).
Lens found extensive intraspecies variation in almost every bacterial species in the human gut metagenome (Figure 2 (a); Supplementary Figures 8, 9), which in total contained more than 200,000 variants (Supplementary Methods). Lens assembled these variants into 5,024 haplotypes distributed across about 2,204 genomic regions with an N50 length of 19 kbp (Supplementary Figure 10); each region contained on average 3.93 bacterial haplotypes. More than 95% of regions overlapped with ORFs, and the longest region we found spanned 112 Kbp or 242 variants and contained four distinct haplotypes.
We observed that significantly fewer variants were present in essential genes26 compared to non-essential genes, as expected from evolutionary pressure (Supplementary Table 14; p < 0.02). We repeated the same analysis on the mock metagenome and also uncovered a small number of genomic variants (few were expected, as the sample is synthetic; see Supplementary Table 15); as in the gut metagenome, significantly fewer variants were found in essential genes26 (Supplementary Table 14; p < 1e-3). We used the variant annotation package SNPEff27 to predict the deleteriousness of each mutation in the genome of E. coli (Supplementary Table 16), and found that most variants had low to moderate effects. Only six variants had high effect and were all found in non-essential genes rhsB, ydfK, icd, and perR. These observations suggest that the variants Lens uncovers are not attributable to noise.
To evaluate the correctness of the phased haplotypes, we determined whether they satisfy perfect phylogeny28. A tree over haplotypes satisfies perfect phylogeny if all strains evolved from a common ancestor, and during this process, each position mutated at most once. Although this criterion is not applicable to distantly diverged species, it is useful when organisms undergo short evolutionary distances, as in the case of bacterial subspecies. In the human gut metagenome, most (85%) of the genomic regions harboring at least four haplotypes satisfied perfect phylogeny (for three haplotypes or less, perfect phylogeny always holds). When the model is not met (such as when certain positions have mutated twice), it is possible to measure the extent to which it is violated by estimating the number of positions that can be excluded to make perfect phylogeny hold. We were able to place more than 92% of all gut metagenome regions in perfect phylogeny by excluding at most 5% of positions within each region (Supplementary Figures 10, 11). These observations support the hypothesis that the variants we find correspond to distinct bacterial strains that have evolved from one another. Our approach is the first, to our knowledge, to uncover substrain resolution and offers a snapshot of how strains evolve in vivo.
Nanoscope uses the FCP software package29 to assign taxonomic labels to assembled contigs. FCP determines labels using either a homology-based approach (based on the lowest common ancestor or LCA algorithm), or a composition-based30 approach (a Naïve Bayes or NB classifier trained on k-mer frequencies). In principle, longer contigs should be easier to label because they contain more species-specific sequences and they should map with less ambiguity to known reference genomes.
In the human gut metagenome, 61.4% of contigs assembled from long reads could be labeled using the LCA method, compared with 46.5% of contigs derived from short reads (Supplementary Table 17). Similarly, 89.8% of contigs assembled using long reads could be labeled by the NB method, compared to 11.0% of contigs assembled using short reads. To assess the accuracy of these assignments, we used the mock metagenome. LCA assigned contigs with 100% accuracy on both long and short reads, whereas NB had accuracies of 99% and 98% on contigs obtained from long and short reads respectively (Supplementary Tables 18–21).
By examining the taxonomic labels assigned to the longest contigs in the gut metagenome, we were able to identify which bacteria could be assembled completely or almost completely (Supplementary Table 22). Five of the ten longest contigs belonged to the genus Bacteroides, and the longest 3.9 Mbp contig was from the genus Odoribacter. Many contigs – including one measuring 2.3 Mbp – could not be assigned accurate labels; these contigs correspond to either unknown species, or to species whose genomes have not yet been fully assembled (Supplementary Tables 23, 24). Notably, the above 2.3 Mbp contig did not match any known reference genome by more than 3.2 Kbp; however, we found that contigs from a fragmented draft assembly of a species from the genus Acinetobacter mapped to the 2.3 Mbp contig completely (Supplementary Figure 12). This suggests that we were able to recover the genome of that bacterium at a higher level of quality than a previous study that used metagenomics samples from 396 different individuals8. We observed similar mapping results for other unclassified contigs as well.
Finally, we determined the abundance of each bacterium by mapping short reads to gut metagenome contigs and then using the above taxonomic labels to propagate the resulting coverage estimates to each identified bacterium (Supplementary Methods). We found 178 species in the human gut metagenome and these species greatly varied in their abundance: some comprised as much as 5% of the metagenome, and others as little as 0.02% (Figure 3; Supplementary Table 25). Interestingly, different species were recovered by short and by long reads: short reads helped finding two relatively high-abundance bacteria, whereas long reads uncovered 51 species (mostly of low abundance) that were missed by short reads (given the same amount of sequencing). Moreover, by combining both short and long reads, 58 additional low-abundance bacteria could be identified. Finally, we found that on the mock metagenome, abundance estimates were highly concordant with those obtained from mapping short reads directly to the 20 known reference genomes (r^2 = 0.97; Supplementary Figure 13; Supplementary Table 26). This serves as an indicator of the accuracy of our approach.
Including synthetic long reads in metagenomic analyses significantly improves the delineation of complex metagenomic samples relative to short-read sequencing. Although synthetic long reads have sequencing biases that affect coverage in specific genomic regions, these biases can be overcome using a small amount of additional shotgun sequencing. The resulting approach offers three main advantages over existing methods. First, we recover long bacterial contigs (up to megabases in length) that span operons that could not be recovered by short read sequencing alone. Second, analysis of long reads produces kilobase-long haplotypes that can reveal evolutionary trends in microbial communities. Finally, longer sequencing read lengths enable identification of bacteria at abundances as low as 0.02% that are undetected by short reads.
Our approach is of course not without limitations. Synthetic long read technologies rely on a dilution step that attempts to reduce the number of copies of each repeat to at most one per well; genomes of bacteria at high abundances of 10% or more may not be sufficiently diluted, and several repeat copies may remain in a single well, preventing the subassembly of their long reads. This may explain why some gut metagenome bacteria at ~5% abundance were not identified by long reads alone. A related issue is the presence of short tandem repeats, which may also prevent subassembly. A third shortcoming of our approach is the reliance on a PCR step, which could introduce bias (see Supplementary Figures 3–4) and error. However validation using the mock metagenome and comparison with short shotgun reads indicates that this error does not exceed 2–3% (Supplementary Methods).
Despite these limitations, our approach can more readily shed light on the configuration of complete operons, and facilitate the identification of pathogenic strains (especially in a mixed population). For instance, we assembled multiple kilobase-long flagellar operons that affect motility and thus play a role during infection. Furthermore, the substrain resolution enabled by our methods could assist in understanding how evolution shaped strains over time in situ. Finally, the ability to identify low-abundance bacteria will help reveal the complete composition of environmental samples and discover new species.
These results are comparable in many ways to previous methods that required hundreds of human subjects8, multiple DNA extraction methods7 or tetranucleotide binning with a mix of Sanger and mate-pair sequencing6 (Table 3). Our method, on the other hand, requires only a single metagenomic sample, does not involve binning, and is superior at identifying bacterial strains. It is also feasible that long reads might work in tandem with previous approaches, since longer contigs will increase the accuracy of their statistical components. Similarly, longer contigs might improve the accuracy of scaffolding techniques such as ones based on Hi-C34. Finally, our approach is most closely related to single-molecule real-time sequencing (SMRT), an alternative long-read technology10. To compare our two methods, we assembled a publically available dataset for the mock metagenome (Supplementary Materials) using the MHAP assembly strategy33. This produced long contigs with fewer misassemblies than ones we obtained from synthetic long reads; however, the SMRT contigs also had a 5x higher indel rate and only 88% of SNVs called using Lens in these contigs could be confirmed with short reads (compared to 99% for synthetic long reads). It thus appears that SMRT reads produce excellent draft metagenomic assemblies, but their high error rate makes it difficult to identify and phase variants in the metagenome.
Finally, synthetic long reads were recently used by Sharon et al. to analyze a soil metagenome sample14. By focusing on a set of marker genes, they showed that the community comprised a combination of closely related strains and rare species. Our work demonstrates that long reads can produce much longer contigs than ones described by Sharon et al. (Table 3); in addition, owing to the lack of need for a marker gene set, our analysis pipeline can find additional species, including a phylum that marker genes did not reveal on the Sharon et al. environmental dataset (see Supplementary Material for full discussion).
In conclusion we reveal that the human gut microbiome is more complex than previously thought, particularly in terms of subspecies diversity. The rapid evolution of bacterial strains at the subspecies level could affect human physiology16. Armed with more complete inventories of microbiomes, it might be possible to strengthen associations between human and bacterial phenotypes.
Mock microbioal DNA, HM-277D Staggered v5.2H, was obtained from BEIresources. Gut microbiome DNA was isolated from the frozen feces of a healthy subject using PowerSoil DNA Isolation Kit (MO BIO Laboratories, Inc.). Both DNA samples were sequenced using Illumina Tru-seq synthetic long reads technique (three and seven libraries for mock and gut microbiome samples, respectively) and the standard shortgun technique, with each library sequenced on one full lane of HiSeq. All libraries were prepared according the manufacturer’s standard protocol. Shotgun sequence reads for both the mock and the gut metagenomic samples were subsampled at random to produce subsampled libraries containing the same amount of base pairs as the in the Tru-seq synthetic long read libraries. The results were assembled on the Illumina Basespace platform, according to standard protocol.
We used Ovation Ultralow DR Multiplex Systems 1–8 (0330-32, NuGEN Technologies, Inc.) for whole genome library preparation. Briefly, 100 ng of intact gDNA was diluted into 120 μL of 1X low EDTA-TE buffer and transferred to Covaris snap cap microtube and Fragmented to 300 bp following Covaris recommended settings. Fragmented DNA was purified using Agencourt RNAClean XP bead, provided by Nugen Library preparation kit. The sheared DNA was then subjected to end repair and adaptor ligation. Adaptor ligated libraries were purified with Agencourt RNAClean XP bead and amplified using 18 PCR cycle of 94°C for 30 sec, 60°C –for 30 sec, and 72°C for 1 min. Agencourt RNAClean XP bead was used for amplified Library Purification and libraries Fragment distribution was validated on Bioanalyzer DNA Chip 1000.
In order to facilitate the analysis of synthetic long read data for in the context of metagenomics, we have developed a bioinformatics pipeline called Nanoscope (Figure 1). Nanoscope takes as input a set of long read libraries together with optional (but highly recommended) short read libraries. It then performs a four-stage analysis of this data that includes de-novo assembly, variant calling and haplotyping, taxonomic identification, and abundance estimation.
Nanoscope starts by invoking the Soapdenovo20 and Celera21 assemblers to independently assemble the short and long read libraries, before merging the results using Minimus222. In the next step, it invokes a variant calling and phasing algorithm called Lens to analyze the assembled contigs for strain variation. Lens reveals hundreds to thousands of sites where individual bacteria of the same strain differ from each other and then phases these variants into bacterial haplotypes. A typical contig might harbor more than a dozen different strain haplotypes, each of which may contain thousands of sequence variants. Variants and haplotypes are determined using a simple model (see the section on Lens below) that, unlike previous approaches35,36, does not make any assumptions on the length of sequencing reads or the ploidy of the organism; we have found that these factors may confuse existing bacterial variant callers and lead to suboptimal results.
Finally, Nanoscope invokes the FCP software package29 to assign taxonomic labels to assembled contigs and to estimate bacterial abundances. The latter task is done by mapping short reads to assembled contigs and by aggregating the coverage over all contigs assigned to the same species. Computing abundances from short reads avoids certain biases inherent to synthetic long reads; mapping reads to contigs enables estimation of abundances for bacteria whose genomes are not present in standard databases. At each stage, Nanoscope uses the popular Quast tool41 to assess its performance and to generate reports.
Nanoscope differs from existing metagenomic pipelines42,43 because it includes additional programs for dealing with synthetic long reads (most notably, the Celera and Minimus2 assemblers). We modified the source code of some of these packages to handle longer genomic sequences (see Supplementary Material); all programs used by Nanoscope have also been tuned for longer read lengths. The source code of Nanoscope is publically available in an open-source repository.
Lens is a new variant calling and phasing tool specialized for metagenomes and synthetic long reads. It is based on algorithms that, unlike previous approaches35,36, do not make any assumptions on the length of sequencing reads or the ploidy of the organism; we have found that these factors may confuse existing bacterial variant callers and lead to suboptimal results (see Supplementary Material). At a high level, Lens does two things: starting from an alignment of long reads to assembled contigs (or to bacterial reference genomes), it first determines positions at which the reads and the reference differ; these positions are indicative of multiple closely related strains of the same bacterium. Then, Lens phases these variants into long haplotypes, each haplotype being defined in this context as a set of variants that co-occur within the same bacterial substrain.
The Lens haplotyper leverages the fact that each long read originates from a single organism, and therefore all variants within a read must belong to the same substrain. By connecting reads at their overlapping variants, Lens places the variants into multi-kilobase-long haplotypes in a process that is reminiscent of single-individual haplotyping (SIH) techniques37. In our setting, the number of true haplotypes is an unknown parameter that may be greater than two, making the phasing problem considerably more difficult. Although there exist well-known phasing algorithms for polyploid genomes (e.g. plants or cancer genomes), they all assume a fixed, known ploidy38,39, with the notable exception of some recent methods developed while this paper was under review 44,45; Lens on the other hand infers the ploidy directly from the data. More precisely, Lens assembles haplotypes using an approximate greedy procedure (see Supplementary Material); this choice is in part due to the fact that the SIH problem (of which ours is a generalization) is computationally intractable40. In brief, Lens sorts aligned reads from left to right and in turn uses each read to either extend an existing haplotype or to form a new one, depending on the read-haplotype overlap and on the cost of forming a new cluster (both are tunable parameters for the algorithm). Our high-level approach may in principle have applications outside metagenomics, such as in cancer genome phasing.
This work was supported by NIH/NHGRI grant T32 HG000044. V.K. was supported by an NSERC post-graduate fellowship. We thank Illumina, Inc. for their assistance in sample preparation.
ContributionsS.B. and M.S. conceived the study. W.Z. and F.J. performed library preparation. V.K. developed the Nanoscope pipeline and the Lens algorithm. V.K. and C.J. performed computational analyses. V.K., C.J., S.B. and M.S. wrote the paper. S.B. and M.S. supervised the study.
V.K. serves as a consultant for Illumina Inc. SB is a co-founder of DNAnexus and a member of the scientific advisory boards of 23andMe and Eve Biomedical. MS is a co-founder of Personalis and a member of the scientific advisory boards of Personalis, AxioMx and Genapsys.
1Note that the contigs of Nielsen et al. are also clustered into unordered sets belonging to the same species.