In the field of microbial ecology, the polymerase chain reaction (PCR) has been widely used for the amplification, detection and quantification of DNA targets since its introduction [1
], resulting in increased knowledge of the microbial world [3
]. However, the efficiency and accuracy of PCR can be diminished by many factors including primer-template mismatches, reactant concentrations, the number of PCR cycles, annealing temperature, the complexity of the DNA template, and others. [5
]. Primer-template mismatches are the most important because they can lead to selective amplification which prevents the correct assessment of microbial diversity [8
]. Target sequences that cannot match the primers precisely will be amplified to a lesser extent, possibly even below the detection limit. The relative content of the sequences achieved is therefore changed, resulting in a deviation from the true community composition. Hence a comprehensive evaluation of bacterial primer coverage is critical to the interpretation of PCR results in microbial ecology research.
Many related studies on primer coverage have been performed previously, but most are qualitative or semi-quantitative studies restricted to the domain level [10
]. Low coverage rates in some rare phyla might have been overlooked.
Although Wang et al. [12
] investigated primer coverage rates at the phylum level, only sequences from the Ribosomal Database Project (RDP) were used. This sole reliance on the RDP is another common limitation of previous studies. The RDP is a professional database containing more than one million 16S rRNA gene sequences. It also provides a series of data analysis services [13
], including Probe Match, which is often used in primer studies. However, despite the RDP’s large collection of sequences and extensive application, most of its sequences were generated through PCR amplification. Sequences that fail to match the universal primers may become lost in the PCR results, and so are not included in the RDP. Consequently, primer coverage rates in the RDP appear to be higher than they actually are.
Fortunately, with the rapid development of sequencing techniques, many large-scale metagenomic datasets have become available. Metagenomic sequences are generated directly from sequencing environmental samples and are free of PCR bias; thus, the resulting datasets faithfully reflect microbial composition, especially in the case of rare biospheres. The Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA) is not only a repository for rich and distinctive metagenomic data, but it also provides a set of bioinformatic tools for research[15
Another shortcoming of previous primer-coverage studies has recently been illuminated through studies on the PCR mechanism. In the past, it was assumed that a single primer-template mismatch would not obstruct amplification under proper annealing temperature so long as the mismatch did not occur at the 3′ end of the primer. However, recent studies have shown that a single mismatch within the last 3–4 nucleotides of the 3′ end could also significantly reduce PCR amplification efficiency, even under optimal annealing temperature [16
]. This changed the criteria for judging whether a primer binding-site sequence could be amplified faithfully by PCR. In this study, we define sequences that “match with” the primers as having either no mismatch with the primer, or as having only one mismatch that is not located within the last 4 nucleotides of the 3′ end.
All of the primers in this study are frequently used in molecular microbial ecology research. The most common primer pairs are 27F and 1390R/1492R, which are mainly used for constructing clone libraries of the full-length 16S rDNA sequence [18
]. The primers such as 338F and 338R are frequently used in pyrosequencing [19
]. The remaining primers are most commonly used for fingerprint analyses, but the development of next-generation sequencing techniques will likely broaden their roles in future studies [22
]. Pyrosequencing has extended the read length from 100bp to 800bp [24
], and as a result, hypervariable regions in 16S rDNA other than V6 and V3 will be able to be sequenced. Those primers that can cover these hypervariable regions will become more frequently used.
The aim of this study was to assess the coverage rates of 8 common primers (27F, 338F, 338R, 519F, 519R, 907R, 1390R and 1492R), which target different regions of the bacterial 16S rRNA gene, using sequences from the RDP and 7 metagenomic datasets. We used the non-coverage rate, the percentage of sequences that could not match with the primer, as the major indicator in this study. Non-coverage rates were calculated at both the domain and phylum levels, and the influence of a single mismatched position on the non-coverage rate was analyzed. By comparing the RDP and the metagenomic datasets, we found that the non-coverage rates were seriously underestimated when only the RDP dataset was used.