The bacterial small subunit of the ribosomal gene, the 16S rRNA gene, is the most common housekeeping genetic marker used in bacterial phylogeny and taxonomy. The reasons for this are its presence in almost all bacteria, relative stability over time and its size that is large enough for informatics purposes (1
). Cloning of the (nearly complete) 16S rRNA gene in Escherichia coli
and sequencing, although highly elaborate and costly, became a standard method in determining microbial community composition (2
). With the advent of high throughput next-generation sequencing (NGS) technology, the cloning bias could be circumvented and the costs per nucleotide substantially reduced. Now, the standard method of assessing the taxonomic composition of microbial communities is to sequence the 16S rRNA gene, using PCR amplification and NGS technology. The bacterial 16S rRNA gene consists of conserved sequences interspersed with variable sequences that include nine hypervariable regions (4
). These regions are flanked by conserved parts of the 16S rRNA gene, which are used in primer designs to target as diverse a bacterial community as possible. The sequences of the hypervariable regions themselves are used to discriminate among bacterial taxa.
Different hypervariable regions evolve at different rates and different species of the same genus (or e.g.
genera of the same family) may be similar in some hypervariable regions and more divergent in others (5
). Primer bias occurs when the selected primers do not anneal to the DNA from all members of the community equally, but preferentially amplify certain taxonomic groups. For instance, Verrucomicrobia
, a bacterial phylum previously thought to occur in soil at a low abundance, was shown to be highly abundant in different soil samples by simply replacing commonly used primer set 27F/338R (V1–V2), obviously biased against Verrucomicrobia
, by the primer set 515F/806R targeting hypervariable region V4 (7
). Assessing the nature and extent of primer bias is an important first step whenever primers are selected. In silico
testing for the most effective regions for discerning taxa from a particular environment or for finer resolution of particular taxa would have a large impact on experimental costs and outcomes. This has recently been demonstrated within the Human Microbiome Project (8
), where both the V1–V3 and the V3–V5 sections of the rRNA gene were sequenced, trimmed and clustered into 3% operational taxonomic units (OTUs) (9
). The V1–V3 data showed three dominant Lactobacillus
OTUs, which appear to differentiate L. crispatus
, L. iners
and L. gasseri
). These OTUs correspond to the three primary vaginal biome types identified by Zhou et al
) and Ravel et al.
). The V3–V5 sequence data, however, was dominated by only one OTU, which included over six different Lactobacillus
species. Conversely, the V3–V5 sequence data identified a Bifidobacteriaceae
OTU that was not detected as such with the V1–V3 sequences.
The data resulting from PCR amplification and NGS sequencing requires processing through a bioinformatics pipeline. This pipeline should assure that low quality sequences are discarded and meaningful groups or clusters of sequences, OTUs, are created. The representative sequence of each OTU is then compared with sequences found in publicly available 16S rRNA gene databases and, when possible, a consensus taxonomic lineage (genus, family or higher taxon) is given to the OTU. For these downstream analyses of the sequences, only the amplified part of the 16S rRNA gene is required. The use of the short amplicon sequences instead of the full-length rRNA gene as reference sets in computational pipelines, reduces the run times considerably. Some programs such as GAST (13
), used to assign taxonomy based on the best match in a Global Alignment for Sequence Taxonomy, require a trimmed database that matches the length of the amplicons. An additional advantage of using a trimmed database is that it can serve as a quality check for accurate trimming of (the sequenced) amplicons.
Programs already exist that test which sequences match a given oligonucleotide probe. For the different rRNA gene databases, these are SILVA’s TestProbe (14
), Greengenes’ Probes (15
) or RDP’s Probe Match (16
). Probes can be designed using stand-alone software, such as Primrose (17
) and PrimerProspector (18
). The latter provides a probe/primer design pipeline that supports de novo
barcoded primer design and includes command-line scripts to analyze taxonomic coverage. Most programs, however, do not return trimmed reference sequences matching the probes.
We have developed TaxMan, a straightforward web-tool, to trim the reference sequences of several rRNA gene databases to the hypervariable regions used, based on pre-selected primers, and to interactively analyze taxonomic coverage. We show that the use of the provided trimmed sequences in computations increases analysis speed. Additionally, by assessing the ability of amplification products to differentiate specific taxa from a particular environment, thus by analyzing the taxonomic coverage using several rRNA gene databases, before performing the sequencing, researchers will be able to better target their experiments to resolve the taxa of greatest interest to their research question. To this end, TaxMan also provides graphical analysis of the taxa that are selected for or against with the selected primer set(s).