To examine how the size of PCR amplicons affects estimates of microbial diversity and taxonomic assignments in microbial ecology studies, we constructed three clone libraries each from two hydrothermal vent fluid samples with amplicons of approximately 100bp, 400bp, and 1000bp; all containing the V6 hypervariable region of the SSU rRNA gene. This allowed for direct comparisons between the different size libraries by examining only the V6 region using OTU- and taxonomy-based tools. All results support the conclusion that the 100bp amplicon libraries contained more different types of sequences than the other two libraries and that more of those sequences are different from known sequences in the reference database. In addition, both the taxonomic assessments and community similarity index showed that the 100bp libraries are different in their community structure compared to the other two libraries for each sample, and that those other two libraries are very similar.
There are many possible reasons for the differences seen between the three clone libraries. One obvious difference between the three libraries is that different primer combinations were used to generate each size class. Three primer combinations were used that included the V6 region, and each library from each sample shared one primer in common with another library from the same sample. PCR and cloning conditions were nearly identical between all reactions (amount of template, concentration of primers, dNTPs, Pfu, etc.) with the exception of the annealing temperature and extension time. For the 1000bp primer set, the annealing temperature was 55 °C and the extension time 2 min. For the 100 and 400bp sets, the annealing temperature was 57 °C and the extension time 1 min. For all samples, three individual reactions were pooled for cloning. One possibility for the similarities in community structure between the 400 and 1000bp libraries is that the same reverse primer was used (1391R). However, the same forward primer was used for the 100 and 400bp library (967F), and those libraries are different. More importantly, the 1391R primer appeared to anneal less faithfully in the 1000bp reaction, as suggested by the high number of mis-primed sequences detected in these libraries. Some of the mis-primed and chimeric sequences identified were exact matches to high quality V6 regions, indicating that these artifacts were generated from valid ribosomal RNA sequences. Others were non-ribosomal genes that apparently were amplified by the 1391R primer at both the 5’ and 3’ ends. Even when we included all of the artifactual sequences in our analyses, it is clear that the 100bp dataset contained more diversity than the 1000bp dataset.
All primers used are located in regions of secondary structure, which may affect primer annealing. Polz and Cavanaugh found overamplification of specific templates and determined that the higher the GC content of the priming region, the higher the resulting amplification efficiency (1998). We examined the GC content of the priming region for each primer in E. coli and found that the 967F primer had the lowest GC content (53%), compared to 62–67% for the other three primers, suggesting GC content of priming regions is not a major contributing factor in our study. Polz and Cavanaugh also suggested that degeneracy in primers should be avoided, as it is known that primer degeneracy can reduce specificity and result in particular primers running out as the reaction progresses (1998). However, acknowledging that not one primer fits all, they recommend pooling replicates to decrease variation in PCR reactions. While degenerate primers were used in our experiments, ranging from zero to 64-fold, we also pooled replicates to decrease variation. In addition, even though the 1000bp primer set had a combined 512-fold degeneracy, similar results were found with the 400bp primer set, which only had a combined 8-fold degeneracy. Primer specificity with respect to taxonomy also does not appear to explain our findings, with the least “universal” primer set resulting in the highest diversity estimates (). We do not believe that primer specificity explains our results.
Another possible explanation for the differences seen between the libraries is cloning bias. There is little published data regarding cloning bias with 16S rRNA genes, but it is plausible that the longer fragments with more secondary structure may interfere with E
ribosome assembly or growth. Rainey et al. (1994)
found that different taxa were obtained in clone libraries made with the same primer set but different cloning systems. In addition, as previously noted, it is unlikely that mixed communities of amplicons will clone with uniform efficiency, and it is most likely the low abundance genes that will account for this variation (Wintzingerode et al., 1997
). Cloning bias remains a possible explanation for our results, particularly with respect to the low abundance members of the community.
An additional source of error in our experiment is related to the kinetics of PCR. It has previously been noted that the PCR kinetics favor smaller amplicons (Kleter et al., 1998
). Suzuki and Giovannoni (1996)
tested two different primer pairs targeting two different sized amplicons, using 3 cloned ribosomal genes as standards. When the smaller amplicon primer set was used, regardless of starting template concentrations, a bias towards 1:1 product ratio was observed and was dependent on the number of PCR cycles. They attributed this difference to kinetic bias, where the smaller primer set amplified at higher efficiency, resulting in the reaction reaching saturation conditions (Suzuki and Giovannoni, 1996
). Saturated templates can then reanneal and inhibit further amplification, while undersaturated targets will continue to amplify, resulting in the skewed product ratio. The other larger amplicon primer set amplified at lower efficiency, but only showed minimal bias in amplification product ratios. However, they note that in highly diverse environmental DNA samples, it is unlikely that any particular gene will reach saturation, and thus the reannealing kinetic bias effect is unlikely. As a follow-up to this work, Suzuki et al. (1998)
further examined this kinetic bias in natural populations and found that the template reannealing bias could result in the over-representation of rare members of the microbial community and an under-representation of dominant members. However, others have not observed the same results. Sipos et al. (2007)
did not find that reannealing was important in diverse template environmental samples, but instead found that the annealing temperature was key to reducing preferential amplification. This is similar to the findings of Leuders and Friedrich (2003) and Acinas et al. (2005)
, neither of whom found bias caused by cycle number or the reannealing effect. While the data in and may suggest a kinetic bias, we believe the skew in distribution of the library is due to undersampling of the smallest library, not kinetic bias.
The formation of PCR artifacts, such as heteroduplexes and chimeras, is another known problem in mixed community amplifications (Qiu et al., 2001
). Many recommendations for how to minimize these artifacts have been published. For example, Qiu et al. (2001)
suggested using fewer PCR cycles, longer extension times, Ampli
Taq (over other types of Taq polymerases), and pooling reactions. They also noted that the artifacts increase as the diversity of the mixed community increases. Thompson et al. (2002)
demonstrated that heteroduplexes increased with primer limitation, the number of different sequence variants in the original PCR, and the number of variable nucleotides in the target, and they recommended a ‘reconditioning’ step to reduce the possibility of heteroduplex formation (Thompson et al., 2002
). No reconditioning to eliminate heteroduplexes was carried out on any of the samples, and all samples were treated identically (with the necessary exceptions of annealing temperature and extension time). One might predict more PCR artifacts in the largest library due to the 512-fold degeneracy of the 337F/1391R primer combination, the large number of nucleotide variants, and the greater chance of the polymerase falling off due to encountering secondary structure. Indeed, we found more artifacts in the largest libraries, as indicated by the high number of sequences flanked by primer 1391 at both the 5’ and 3’ ends of the amplicon. As noted, some of these sequences did contain valid ribosomal RNA sequences. In contrast, the smallest amplicon library containing the V6 region is not a very likely site of recombination due to its high variability. However, because we were unable to screen for artifacts in the 100bp library, we also ran all analyses using artifact sequences and found that the 100bp library contained more diverse, unique sequences than the other 2 libraries.
Finally, the polymerase is a potential source of error in our experiments. All of the amplifications were carried out with the high fidelity, proof-reading Pfu
Turbo polymerase. It has previously been noted that some polymerases have lower efficiencies when amplifying large fragments (>900bp) or regions of high GC content. However, Pfu
Turbo does not appear to be as sensitive to amplicon size as other polymerases (Arezi et al., 2003
). The inability of polymerases to amplify long fragments as efficiently as short fragments has been noted previously (Suzuki and Giovannoni, 1996
; Wintzingerode et al., 1997
; Kleter et al., 1998
; Becker et al., 2000
). This is especially important for the SSU rRNA gene, where encountering problematic secondary structure is quite likely, potentially causing the polymerase to dissociate from the template (Chou, 1992
; Suzuki and Giovannoni, 1996
; Wintzingerode et al., 1997
; Polz and Cavanaugh, 1998
; Qiu et al., 2001
). We believe this may be an important source for the differences in the diversity estimates and community composition of the libraries. As the polymerase encounters secondary structure in the SSU rRNA gene, it dissociates, and the frequency of dissociation is thus correlated with amplicon length. This relationship is not necessarily linear, as we saw more similarity between the 400 and 1000bp library, suggesting that the secondary structure in the 1046–1391 region of the SSU rRNA may have caused problems for both primer sets. The extremely short length of the 100bp amplicon likely serves as an easier template for PCR to proceed.
The results of this study have important implications for molecular studies of microbial communities. While sequencing large portions of the SSU rRNA gene is essential for detailed phylogenetic analysis, long amplicons may not be the most appropriate tool for measuring total community diversity or taxonomic membership. Regardless of sequencing technology used, the primer set and amplicon size must be considered when designing appropriate molecular microbial ecology experiments. Obviously, if full phylogenetic reconstruction of environmental sequences is desired, larger amplicons are necessary. All three libraries captured the dominant bacterial groups, but the 400 and 1000bp libraries missed the more divergent and possibly low abundance groups, including members of the rare biosphere (Sogin et al., 2006
). Therefore, if capturing the most abundant members of a microbial population is the goal, any size amplicon should suffice. However, if a more complete picture of the microbial community structure, membership, and diversity is desired, a smaller amplicon is likely better because it represents a broader sampling of the population, there is little or no systematic loss of specific groups, the PCR proceeds more efficiently, and the opportunity for artifact formation is less. At some point, there is a trade off when the increased diversity detectable by the larger number of informative positions in the longer amplicon is overwhelmed by the number of distinct successful amplicons generated for the smaller length target. Smaller amplicons, however, do require additional sequencing effort because the library contains many more different types of sequences than larger libraries, therefore necessitating deeper sequencing to fully capture the diversity of the library and the microbial community structure. Less sequencing effort is required of larger libraries because there are fewer different sequences present and more modest sequencing efforts should capture the dominant players. All of these parameters need to be taken into consideration when carrying out PCR-based molecular surveys of microbial communities.