|Home | About | Journals | Submit | Contact Us | Français|
To date, metagenomic studies have relied on the utilization and analysis of reads obtained using 454 pyrosequencing to replace conventional Sanger sequencing. After extensively scanning the 16S ribosomal RNA (rRNA) gene, we identified the V5 hypervariable region as a short region providing reliable identification of bacterial sequences available in public databases such as the Human Oral Microbiome Database. We amplified samples from the oral cavity of three healthy individuals using primers covering an ~82-base segment of the V5 loop, and sequenced using the Illumina technology in a single orientation. We identified 135 genera or higher taxonomic ranks from the resulting 1,373,824 sequences. While the abundances of the most common phyla (Firmicutes, Proteobacteria, Actinobacteria, Fusobacteria and TM7) are largely comparable to previous studies, Bacteroidetes were less present. Potential sources for this difference include classification bias in this region of the 16S rRNA gene, human sample variation, sample preparation and primer bias. Using an Illumina sequencing approach, we achieved a much greater depth of coverage than previous oral microbiota studies, allowing us to identify several taxa not yet discovered in these types of samples, and to assess that at least 30,000 additional reads would be required to identify only one additional phylotype. The evolution of high-throughput sequencing technologies, and their subsequent improvements in read length enable the utilization of different platforms for studying communities of complex flora. Access to large amounts of data is already leading to a better representation of sample diversity at a reasonable cost.
Oral health, which is strongly influenced by oral microbiota, has a significant impact on general health. The bacterial community in the mouth contains species that promote health, and others that contribute to illness. Recent studies have shown that poor oral hygiene and/or the presence of particular species in the mouth may be associated with periodontitis, respiratory infection and intestinal disease (Avila et al., 2009; Kuehbacher et al., 2008; Raghavendran et al., 2007). In addition, the salivary microbiota may be used as a diagnostic marker for cancer (Mager et al., 2005) and periodontal disease (Faveri et al., 2008) as well as to provide insights into human population studies (Nasidze et al., 2009a). Understanding which species are present and how the community is composed in healthy adults is the first step towards understanding how changes can lead to disease.
Experts have recently raised the hypothesis that in some chronic diseases, the “pathogen” might be a disturbed microbial community rather than a single organism (Friedrich, 2008). Understanding the contribution of “behind-the-scenes” species which influence the pathogenicity of other species has already led to important changes in treatment strategies (Sibley et al., 2006). These unexpected interactions are changing how microbiologists think about causation of infection and disease (Lipkin, 2009).
Until recently, knowledge of the bacteria that reside in the human oral cavity was limited to those species that could be cultured in the laboratory. New sequencing technologies have brought tremendous improvements in automated sequencing and analysis of genome features. Today around 900 complete prokaryotic genomes are publicly available (www.ncbi.nlm.nih.gov/genomes/lproks.cgi) as well as more than a million 16S rRNA gene sequences, and several hundred metagenomic datasets. The Human Oral Microbiome Project (www.homd.org) now contains close to 1000 species which have been found in the mouth, while a metagenomic-based estimate of the diversity is one order of magnitude higher (Keijser et al., 2008). The availability of these extensive and varied sequences has opened the way for comparative genomics techniques (Fraser et al., 2000) for evaluating relatedness and diversity as well as studying whole viral or bacterial content of various media (Venter et al., 2004; Williamson et al., 2008) or bacterial infections (Cox-Foster et al., 2007; Nakamura et al., 2008; Turnbaugh et al., 2009).
Here, we evaluate the potential of Illumina high-throughput sequencing with an unprecedented depth of sequence coverage for the study of human oral microbiota diversity. We use partial sequences from the well-characterized and conserved 16S rRNA gene, to enable classification of bacteria from human oral samples.
We collected saliva and oropharyngeal samples over a one-week period from three adult individuals with informed consent. Saliva samples were collected by expectoration into a sterile plastic 50-ml tube and kept frozen at −20 °C until processing. We mixed 100μL of each saliva sample with the same volume of 2× lysis buffer [Tris 20 mM, EDTA 2 mM (pH 8), Tween 1%, proteinase K (Fermentas) 400 μg/ml] and incubated them for 2h at 55 °C (Faveri et al., 2008). Proteinase K was inactivated by a 10 min incubation at 95 °C and the samples were frozen at −20 °C.
Dry cotton swabs (Copan) were used to gently swab the posterior wall of the oropharynx. They were directly suspended in a microtube containing 200 μL of lysis buffer and processed in the same way as the saliva samples. The saliva and oropharyngeal lysates from all three subjects were mixed in a 1:10 ratio with roughly equal contributions from the two sampling sites according to PCR yield.
We aligned 753 16S rDNA sequences from the Human Oral Microbiome Database (HOMD, October 2008) using MAFFT (-FFT-NS-2, v6.531b) (Katoh et al., 2002). We chose primers from the conserved areas of the alignment flanking the V5 region so as to match most sequences. With a 100%match, primers 784DEG(5′-GGMTTAGATACCC) and 880RDEG (5′-CRTACTHCHCAGGYG) sequences produced 740 and 745 hits, respectively, or 732 (97.2%) of the HOMD sequences. Species coverage was within the 91–100% range for all HOMD bacterial phyla except Chloroflexi which is very rare in oral microbiota (Keijser et al., 2008) and has a single representative in the HOMD.
PCR amplification was carried out in a 50 μL PrimeStar HS Premix (Takara) containing 5 μL of lysate and 0.5 μMof each forward (784DEG) and reverse (880RDEG) primer. The samples were run in two separate PCRs for 15 cycles using the following parameters: 98 °C for 10 s, 46 °C for 15s, and 72°C for 1min. The two PCRs were then pooled and phosphorylated with polynucleotide kinase and the Illumina paired-ends adapters were ligated with T4 DNA ligase. After PCR amplification with Phusion for 10 cycles using Illumina paired-ends PCR primers, the library was quality controlled by cloning an aliquot into a TOPO plasmid and capillary sequencing 16 clones. The library was sequenced from the forward end for 76 cycles on the Illumina Genome Analyzer system GAII using sequencing kits version 3.0. The 16S V5 amplicons correspond to E. coli positions 785 to 894 including primer sequence and to positions 798 to 879 excluding primers.
Base-calling was performed with the GAPipeline 1.3.2 using standard parameters, which include purity filtering with “chastity 0.6”. We removed sequences containing uncalled bases, incorrect primer sequence or runs of ≥12 identical nucleotides. Seventy-two-base sequence reads were trimmed to remove the 13-base forward primer sequence, yielding 59-base sequences.
We assigned taxonomy to sequences with GAST (Huse et al., 2008), using a database of reference V5 rDNA sequences (RefHVR_V5) from SILVA (version 98) (Pruesse et al., 2007), and taxonomy from known cultured isolates, Entrez Genome projects, the Ribosomal Database Project [RDP; (Cole et al., 2005)], Greengenes (DeSantis et al., 2006) and hand curation. GAST compares each tag to the RefHVR_V5 and aligns it to its nearest neighbors in the database and then selects the closest reference (s). The taxonomy for the tag is the lowest common ancestor for a two-thirds majority of all 16S rDNA sequences associated with the nearest V5 reference sequences.
Before generating clusters of phylotypes, we filtered out all sequences that occurred fewer than 3 times. This reduced the number of unique sequences to a computationally manageable level, and potentially reduced the number of errors from sequencing and contamination. We created a multiple sequences alignment of the remaining data using MUSCLE (Edgar, 2004) with parameters -maxiters 2 and -diags, and generated phylotype clusters and diversity estimates using MOTHUR (Schloss and Handelsman, 2005).
To examine which region of the 16S rRNA gene would be possible to target with the short Illumina sequencing reads, we extracted various sections of aligned 16S rDNA sequences available for 753 species in the Human Oral Microbiome Database and submitted them to the RDP classifier with a 80% confidence cutoff. The entire V5 120-base region as well as the 59-base segments from its forward end lead to many fewer unclassified sequences than their V6-region counterparts. (Table 1). Therefore, the paired-end data from the ~82-base V5 region we amplified in the current study would provide a means to capture taxonomic information suitable for studying the microbial diversity with the Illumina technology, similar to that of the favored V1–3 and V6 regions which are used when longer sequence reads are possible.
We explored the microbial diversity of the pooled saliva and oropharyngeal swab samples from three individuals by targeting the 16S rDNA hypervariable V5 regions. Of 1,373,824 obtained reads, 1,237,319 [publicly available at the MG-RAST repository (Meyer et al., 2008) under ID:4444448.3] passed the quality control. They were clustered in 377,275 distinct sequences most of which (330,815) were unique.
We analyzed the taxonomic composition and abundance of the oral bacterial community using GAST (Huse et al., 2008), the MG-RAST server (Meyer et al., 2008) and the RDP classifier (Wang et al., 2007). RDP’s Seqmatch program may also eventually be useful for determining which sequences in the RDP database are most closely related to our sequences, it works for sequences as short as 7 bases but only for 2000 sequences at a time.
The mean RDP confidence level for the six taxonomic levels from domain to genus was calculated as a function of the sequence abundance. The confidence decreases as the sequence copy number decreases (Fig. 1). This general trend is most likely due to the fact that the most frequent sequences correspond to known species whose 16S rDNA sequences are available. Conversely, the rare species include a higher proportion of 16S rDNA sequences absent or distant from those in the RDP reference database. In addition, the probability that a sequence contains an error is expected to be higher in low frequency sequences (Andersson et al., 2008).
To limit the impact of sequencing errors, we removed all sequences that occurred less than three times. This new dataset contains 865,540 reads representing 25,978 distinct sequences. We discarded 381 reads (<0.05%) with a GAST distance that diverged more than 30% from their nearest reference sequence, leaving 865,159 sequences. For the MGRAST analysis with a minimum alignment length of 50 and a maximum BLAST e-value of 10−10 21,713 sequences (2.5%) were removed, leaving 843,827 sequences. The phylogenetic assignments using the RDP classifier were performed after two additional filtering steps. They included the removal of sequences that were better classified when considered as reverse complements and those that had <80% confidence at the domain level. In this way the number of reads was reduced to 854,968 (24,757 distinct sequences).
The combined saliva and oropharyngeal swab samples were dominated by the phyla Firmicutes, Proteobacteria, Actinobacteria, Fusobacteria, TM7 and Spirochaetes (Table 2), that are also abundant in other oral samples assessed by means of phyloarrays (Huyghe et al., 2008) or massively parallel pyrosequencing of the 16S rDNA clones or amplicons (Keijser et al., 2008; Nasidze et al., 2009a). Their proportions, however, differ in different studies (Table 2). Removing the 80% confidence cut-off in the RDP classification results in phyla breakdown that are very similar between RDP and GAST (data not shown); however, bootstrap values of less than 80% cannot be trusted. The MG-RAST server and RDP classifer returned a higher fraction of unclassified bacteria, likely because they are not well designed for such short sequences. This may explain the lower content of major phyla relative to that generated by GAST which was designed specifically for short tag sequences. Proteobacteria is an exception since their abundances were similar with the three classification tools. Therefore, the RDP- and MG-RAST-based classification of V5 rDNA sequences appeared to be more sensitive for Proteobacteria than for other major phyla. Indeed, the RDP classification of the HOMD 16S rDNA sequences showed that the relative abundance of Proteobacteria, in contrast to those of other major phyla, was not reduced when using 59-base V5 sequences instead of their full length counterparts (Table 1).
The ability to identify taxa from class down to the genus level varied between phyla and was dependent on the classification approach (Fig. 2). For the six major phyla, GAST generated the highest proportion of reads placed at these levels of taxonomy. Fusobacteria and Spirochaetes had the largest proportion of reads that can be confidently placed at the genus level using all three classification approaches. This proportion was the lowest for Proteobacteria despite their robust classification at the phylum level (see above).
Some consider organisms with more than 1.3% sequence difference in 16S rDNA sequence (based on the full-length) to belong to different species (Stackebrandt and Ebers, 2006). Since a single nucleotide difference in a 59-base-long sequence corresponds to a 1.7% resolution, there may be more than 25,000 species-level phylotypes in our dataset (Fig. 3). For a more conservative estimate of species-level phylotypes, we used a cutoff of 3% corresponding to a 2-base resolution (Stackebrandt and Goebel, 1994) to create clusters of sequences. There are at least 8,000 different phylotypes at the 3% level. This will be an underestimate since we removed all sequences occurring less than three times prior to analysis. These filtered sequences would include valid but rare organisms as well as many low-quality sequences.
We used rarefaction analysis to determine the microbial diversity recovery in the filtered dataset. The rarefaction curve is very stable at ~8000 (Fig. 3), suggesting that the sampling completeness is high — at least 30,000 additional reads would be required to discover a new unique phylotype, and more than 120,000 additional reads would be required to discover a new 3% phylotype. The removal of unique sequences impacts the rarefaction curve, and may underestimate the potential for new species detection in human saliva samples. However, sampling is sufficient among the sequences likely to be prevalent in human saliva because they were found at least 3 times.
A total of 135 genera or next higher taxonomic ranks were identified by GAST (Appendix A). The most frequent genera were Neisseria and Streptococcus, constituting about 70% of the sequences. Thirty-four taxa have not been identified in previous studies of oral microbiotas (Keijser et al., 2008; Nasidze et al., 2009a,b) and are not listed in the Human Oral Microbiome Database. They include some low-abundance genera as well as putative members of the candidate divisions BRC1, OP10, OP3. The MG-RAST server also identified BRC1 and OP10 sequences.
The observed relative low abundance in Bacteroidetes in our data compared to previous studies may be accounted for by many factors including sampling from different anatomical sites, individual variation, sample size, as well as potential bias in lysis, amplification and classification. Indeed, it has been shown that some of the “frequent” species are absent in some individuals (Aas et al., 2005). Good oral hygiene is known to decrease the proportion of Gram-negative bacteria including some Bacteroidetes species. The amplification bias has been invoked to explain a decline of Bacteroidetes in a metagenomic study of a series of fecal samples (Andersson et al., 2008). This is unlikely to be the case in our study since the PCR primers used cover 104 of 107 (97%) Bacteroidetes species listed in the HOMD.
To the best of our knowledge this is the first metagenomic study based on the utilization of the Illumina high-throughput sequencing technology. Illumina sequencing provides more sequence reads per run, allowing for more in-depth coverage than the competing technologies. This enables analysis of larger sample sizes, inclusion of more bar-coded time-points and samples, and better assessment of total diversity in microbial communities. Metagenomic studies of other human microbial communities in the gut, stomach and skin have shown that it is not clear whether a core community of bacteria is common to most humans, making the less common species important to understanding human health and disease (Hamady and Knight, 2009).
The advantage of generating and sequencing short 16S rDNA amplicons for bacterial community analysis is that it reduces the likelihood of generating chimera and increases the likelihood of detecting low-abundance taxa (Huber et al., 2009). Moreover, reads of 100–200 bases obtained with carefully chosen amplification primers can yield the same clustering as long 16S rDNA sequences (Liu et al., 2007).
There is a concern that short read length may compromise the classification quality. However, for the current Illumina read length using the V5 region of the 16S rDNA, taxonomic assessment at the phylum level is sufficient to effectively compare samples. The taxonomic analysis based on the Illumina technology will be improved by paired-end reads (Table 1) which are expected not only to generate longer sequences but also to increase the sequence quality. Since the probability of a sequencing error increases with the read length (Qu et al., 2009), partially overlapping complementary reads of the same amplicon may help to predict sequencing errors and aid the removal of ambiguous reads or parts of reads.
This work was supported by grants from the Swiss National Science Foundation 3100A0-112370/1 (JS) and 3100A0-116075 (PF), and United States National Institutes of Health grant UH2DK083993 (SH).
Supplementary data associated with this article can be found, in the online version, at 10.1016/j.mimet.2009.09.012.