Identifying 16S sequence
The first stage of our analysis matches reads against known 16S rDNA sequences, or finds the closest matches to known organisms. We leverage the Ribosomal Database Project
release 9 update 39 for its catalog of bacteria and their phylogenetic relationships [19
] and the prokMSA
database as a representative set of archaeal sequence [20
]. For each read, we used the tool BLAT
(BLAST-Like Alignment Tool) [18
] to quickly identify matches between reads and the combined bacterial/archaeal database, using a minimum identity of 90% and a minimum match/mismatch score of 15 bases.
To score the homology between a read and a database sequence, we approximate the probability that the read came from an organism with a p = 98% sequence similarity to the given read as follows:
where Mi are indicator variables that are 1 if position i in the read matches with the database sequence in their alignment and 0 otherwise. The variables Qi are the probabilities that the read bases were called correctly, derived from the sequence quality scores.
Then, to judge whether or not we believe a read came from the organism's phylotype, we compare this probability against the probability for a hypothetical read that falls at the boundary of similarity
P (related limit) = ppL (1 - p)(1 - p) L.
Reads that score above this probability limit are classified as known, while reads that score below the limit are classified as unknown.
Placing reads in the phylogeny
Each read that results in a known match will typically also match with many additional organisms. For example, a read may match with several species within the same genus, in which case we cannot identify the exact species of the read. However, if all the known hits at least fall within the same genus then we are confident the read was sampled from an organism belonging to that genus. In this way, for each read our goal is to determine the most specific classification within the phylogeny that likely contains the organism from which the read was obtained.
We analyze each read r
using the following alrogithm. For every node i
in the phylogenetic tree, we assign a value B
) as follows. For leaf nodes, we set B
) = P
related to i
) defined above for organisms with a scored BLAT hit, and B
) = 0 otherwise. For internal nodes, we set B
) = maxj children (i) B
). This process is illustrated in Figure . At the root node we will therefore have B
(root) = maxi P
related to i
Figure 8 Placing read in phylogeny. Computing the most specific and confident placement of a read in the phylogenetic tree occurs in two stages. First, for each internal node we compute a score that is equal to the maximum of the scores of its children. Second, (more ...)
Next, we traverse down the tree starting at the root node. At each internal node, if the ratio of the two maximal child nodes j and k exceeds a threshold T (i.e. B(j)/B(k) > T), or if B(j) is the only non-zero child, then we descend to node j and repeat the procedure. Once the procedure terminates in an internal node or a leaf node i, we believe with a confidence level related to T that the read came from an organism belonging to the subtree rooted at i. An example of this is illustrated in Figure . We experimented with the choice of T over several orders of magnitude and found that the resulting analysis varied only very slightly. For the analyses performed in this study, we used T = 0.01.
For our analysis of read-length and variable region resolving power, we simulated reads from two hypothetical collections of species. The random profile consists of 387 species of bacteria, selected by randomly traversing down the phylogenetic tree from the root to a leaf, picking each branch with uniform probability, resulting in very high diversity. The sample profile constitutes 330 species of bacteria, created by sampling species from a distribution of genera and species that was consistent with the analysis results from our six samples.
For each simulation, a read simulator generated reads sampled from 16S rDNA sequence selected randomly according to either the random or the sample profile. Reads were sampled with uniform probability from across the rDNA sequence, with a read-length drawn from a Gaussian distribution with average read-lengths of 30, 60, 100, 150, 200, 400, or 800 bp and standard deviations of 20%. Sequencing errors were introduced into the reads at a rate of 1% that consisted of mutations, insertions, deletions, and homopolymer run count errors characteristic of Pyrosequencing.
To understand the effect of read-length on the resolving power of our methodology, we simulated reads from both the random and sample profile with the seven read-lengths L of 30 – 800 bp. For each of the 2 × 7 = 14 parameter sets we produced 30 Mb of simulated read data (N reads @ L bp = 30 · 106 bp) and applied our analysis. By annotating the source species for each read we able to measure the accuracy of its placement in the phylogenetic tree, as in Figure .
To study the effectiveness of restricting the sequencing to known variable regions, we first selected a set of eight minimally-overlapping regions within the 16S rDNA sequence: seven regions each contained one of the known variable regions V1 – V6 and V9, and one region contained both V7 and V8 [27
]. These regions are listed in Table with their E. coli
16S rDNA sequence coordinate ranges as well as 5' and 3' broad-range amplification primers, which we validated by PCR amplifying 16 rDNA from E. coli
. For each primer pair we performed 15 cycles of touch-down PCR, going from 94°C for 30 s, to an annealing temperature ranging from 70°C to 50°C for 30 s, and finally extending at 72°C for 30 s. We then performed 30 additional cycles at 94°C for 45 s, 50°C for 45 s, and 72°C for 45 s, and verified the resulting products via gel electrophoresis.
Next, we again produced sets of simulated reads for both the random and the sample profile, restricted to each region, for a total of 2 × 8 = 16 sets. Each read data set consisted of 300,000 reads with an average read-length of 100 bp and a 1% error rate. We applied our analysis and measured the accuracy of read placement in the phylogeny in Figure .