Phage diversity analyses represent a new level of population diversity beyond what is encountered in other areas of microbial ecology. We illustrate the application of CatchAll to a contig spectrum from a swine fecal metagenome (Allen et al., 2011
). The contig spectrum was generated using Circonspect via the CAMERA pipeline (Sun et al., 2011
). The complete dataset is [(1,4736), (2,521), (3,152), (4,69), (5,46), (6,27), (7,21), (8,18), (9,16), (10,10), (11,9), (12,8), (13,7), (14,6), (15,5), (16,4), (17,4), (18,3), (19,3), (20,3), (21,3), (22,2), (23,2), (24,3), (25,3), (26,1), (27,2), (28,1), (29,2), (30,2), (31,1), (32,1), (33,1), (34,1), (35,1), (36,1), (37,1), (38,1), (39,1), (40,1), (41,1), (42,0), (43,1), (44,0), (45,1), (46,0), (47,0), (48,0), (49,0), (50,0), (51,0), (52,1)]. CatchAll output (slightly abbreviated here) as displayed in the GUI screen or equivalently in the ‘Best Models Analysis’ file is shown in .
This analysis took 309s in GUI mode on a 3 GHz/8 MB RAM 64 bit notebook PC. Computation time depends on the complexity (in particular, the smoothness) of the frequency count data not the original sample size, because the original sequence data are reduced to frequency counts before analysis.
In this case, the best fitted parametric model and its first two alternatives (2a and 2b) are the same, and the third alternative (2c) is very close. The various analyses agree approximately at optimal τ, with Chao1 serving as a lower bound, while some anomalies are seen at max τ, as expected; in particular, ACE and ACE1 should only be used for τ≤≈10, the value of Non-P τmax is displayed only for comparative purposes.
CatchAll selects the the log-transformed version of the weighted linear regression model at τ=5, still agreeing with the other analyses albeit with a larger SE. This demonstrates the robustness of the WLRM, since it is theoretically optimal for data with lower diversity than our phage example.
The best discounted model steps down from a three- to a two-component mixture, and reduces the estimated total diversity by 97.4%, from 67 792 (SE 8656) to 1727 (SE 221). At present, there is no formal statistical hypothesis test to select the original versus the discounted models, so the choice depends on the investigator's level of confidence in the low-frequency counts. This is a topic of current research.