All the parameters explained in the previous section are computed and included in the output table called "Log Likelihood - Estimated Parameters." This comprises, for each sample: the number of sequences in the sample, the mean and maximum pairwise HD, the mean of the best fitting Poisson distribution, the corresponding time since the MRCA, and the goodness of fit P-value It is important to notice that when the sample meets our model's assumptions, the mean of the best fitting Poisson distribution is in fact the mean pairwise HD of the sample. A second table, called "Convolution Estimates," provides the observed HD frequencies and the estimated ones calculated using equation (1). A more detailed explanation of the parameters is provided in the Explanation file on the tool web page.
Figure shows the graphics obtained by analyzing a fragment of the NEF HIV-1 gene (169 base pairs) from patient CH40 [3
]. The data have been published [16
] and submitted to the NCBI Sequence Read Archive http://www.ncbi.nlm.nih.gov/Traces/sra
under accession number SRA020793. This sample was obtained through deep sequencing [6
] and yielded a little over 4,000 sequences, though our tool can easily handle ten times as many sequences: because it works only with counts of pairwise distances, it can handle samples of almost any reasonable size, though very large jobs will slow the server. The left panel in Figure shows the pairwise HD frequency counts (black), the best fitting Poisson distribution (blue), and the expected counts if it were a star-phylogeny (red) on a logarithmic scale. The fact that the red line and the black line are indistinguishable confirms that the sample follows a star-like phylogeny. Because the Poisson fit is very sensitive to deviations in the upper tail of the distribution, the tool outputs graphics in the logarithmic scale whenever the sample size is above 100; this helps visualize possible deviations at the higher distances. Though the values are discrete, lines are used for better visualization. In the right panel a histogram of the frequency counts is shown together with, in red, the best fitting Poisson distribution. In this case the sample yielded a good fit (P
= 0.981) and a time to MRCA of 34 days 95% CI = (31, 38).
Figure 2 Example of output graphics for a 454 sample that conformed to the model. Pairwise HD frequency plots on a logarithmic scale (black, left panel), together with the best fitting Poisson (blue) and the theoretical counts expected if the sample were to follow (more ...)
As a second example, Figure shows a sample drawn from single genome amplification sequencing [3
]. All fifty sequences used for this example are available through Accession Numbers EU575084-133. In this case the original alignment does not yield a good fit to the Poisson distribution (top left panel, red line), but the tool detected APOBEC3G/F mediated hypermutation. By selecting the option to correct for APOBEC signatures by both methods, two more alignments are produced: one where two significantly (P
< 0.1) hypermutated sequences are removed, and one where instead the alignment position where APOBEC induced mutation could potentially affect results are removed. It is noteworthy that the first type of correction still does not yield a good Poisson fit (GOF P
< 0.0001). This is because the sample is overall enriched for G
. One can check this by looking at the Hypermut Results Table and noticing that the sequence called compressedMutations
yields a P
= 6 × 10-6
for APOBEC enrichment. Therefore, only when all positions with a G
embedded in a APOBEC3G/F motif are removed, does one finally achieves a good Poisson fit (P
= 0.865) and a biologically sensible (given the clinical data available for this subject) time estimate of the time since the MRCA of 12 days, 95% CI = (8, 16). This example illustrates how APOBEC enrichment can cause the Poisson fit to fail and hence how it is necessary to isolate the APOBEC induced mutations in order to make sensible estimates on the timing of the infection.
Figure 3 Example of output graphics for an SGA sample that was enriched for APOBEC mediated substitutions. HD frequency plots with best fitting Poisson (red line), on the left (panels A and C), and with theoretical star-phylogeny frequencies (red line), on the (more ...)
Unlike the example in Figure , where a logarithmic scale is used for better visualization, when the sample size is under 100, the star-phylogeny is represented in the manner shown in the panels on the right: the observed pairwise HD frequency counts are shown by the blue histograms, whereas the ones computed theoretically are shown in red. For both APOBEC-corrected and non-corrected samples, the red lines follow the histograms faithfully, which deems both samples as star-like phylogenies.
Both of the examples above obviously meet our model's assumptions of exponential growth with no selection and negligible recombination rate. When one or more assumption is not met the goodness of fit P-value lowers considerably and therefore the time since the MRCA is inaccurate. There are several factors that can cause this to happen: for instance, the infection may be non-homogeneous, the sample may not be "early" enough, or one may have sampled an unlikely early random mutation that distorts the Poisson distribution. When analyzing HIV-1 data, we recommend using samples taken within the first 2-5 weeks of infection, or characterized as Fiebig stage I or II [3
]. At later Fiebig stages selection and recombination are clearly observed, and the diversity is controlled by these later selective bottlenecks. The probability of an early stochastic mutation violating the model assumption is calculated in [14
], and is typically small.
Finally, we notice that our tool can be applied to subsets of sequences sampled at later time points when there is evidence of a narrow bottleneck. For example, in Fischer et al. [16
] we were able to isolate the escape lineages after the immune response had begun and applied the tool to estimate the timing of each lineage. The tool has been used primarily on large HIV-1 data sets [3
], though it can be used on any population that grows in a similar fashion, as appears to be the case of HCV for instance [21