Here, we describe the distributions of the microbial sequence counts observed in two studies of the bacterial differences between two groups of samples. The distributions of the relative abundance variables are highly skewed, non-negative and have a large proportion of zeros, for which commonly used statistical approaches may not be appropriate. Three specific taxa from each study were presented in detail to demonstrate the performance of each approach. Based on this analysis, we show that the application of two-part tests provide more information about sequence count data compared to t-tests and Wilcoxon tests. The Wilcoxon and two-part tests produce similar results when there are smaller proportions of zeros in both groups, but as these proportions increase, the Wilcoxon test is less powerful due to the higher number of tied ranks.

In ecological research, count data with a large proportion of zeros is routinely encountered

[25],

[26],

[27]. In fact, in this case, the large proportion of zeros is intrinsic to the creation of the dataset rather than the data generating process itself. The dataset contains sequence counts for taxa that were observed in at least one sample, if a particular taxon was not observed in a sample it is given a zero value. Therefore, when comparing sequence counts across two diverse groups with differences in the presence/absence of taxa, a large numbers of zero counts are expected. For this reason, it is likely that similar distributions are also encountered in other related sequencing applications such as allele frequency. Moreover, application of the two-part test proposed here is not restricted to sequence count data from microbial ecology studies.

The two-part statistic provides an analytic option for sequence count data due to the unique features observed, mainly, data which is non-normally distributed, high dimensional and contains a large proportion of zeros. Further, it performs an explicit test of both the proportion of samples that contain particular taxa and, simultaneously, the relative abundance between two groups. This approach overcomes limitations in other methods like the t- test, which is affected by outliers, and the Wilcoxon rank sum test that accommodates non-normality but loses power as the number of tied ranks, caused by the large number of zero counts, increases.

It has previously been shown that the two-part tests perform better than the other commonly used tests when the group with the larger proportion of zeros also has the larger mean, as demonstrated by the

*Granulicatella adiacens*,

*Veillonella dispar* and

*Chlorobium* examples. If the opposite holds true then the two-part tests have somewhat reduced power with respect to the commonly used methods

[19]. However, the application of the two-part test remains advantageous given the interest in simultaneously comparing the presence/absence and the mean quantities of taxa. Lachenbruch

[19] provides an empirical simulation study which investigated and compared the power and type I error rates of the two-part test with the single degree of freedom tests considered here.

For large samples, the two parts of the two-part statistic (Z and W) are independent under the assumptions of independent errors of both parts of the test

[28]. However, for smaller studies, where the distributional approximations of the two-part test are not reasonably assumed, extension of this method to a permutation based test

[5] is warranted to more accurately estimate a corresponding p-value. In any case, the tests can be ranked based on the chi-squared statistic to similarly perform feature selection in the event a satisfactory approach to calculating a p-value cannot be obtained. The issue of multiple comparisons from application of the two-part test to each individual taxa was not addressed here. The expansion of the two-part test to a more general permutation based test will accommodate the permutation-based multiple comparisons adjustments previously applied to microarray studies. These approaches have the advantage to account for both the correlation between taxa and the association induced by the relative abundance calculation

[29],

[30],

[31]. Additionally, for more complex study designs, the authors are developing zero-inflated models which others have advocated for this type of data

[32] and are useful for generalizations which include ANOVAs, addition of covariates and repeated measures but which require adaptation and guidelines for high-dimensional applications.

To date, there is a fundamental lack of development and investigation of statistical methods appropriate for integrated sequence and metadata that resulted in an analysis bottleneck and backlog of potentially informative studies

[33],

[34],

[35]. There are several published contentions that human microbiome research “lacks the range of computational tools necessary to analyze these sequences in sufficient detail”

[36]. It is also recognized that interpretation of the available sequence data will require integration with relevant environmental, epidemiological and clinical data

[33],

[36]. The commonly used statistical methods applied in this area are intended for the calculation of global ecological parameters and the description of bacterial communities. These methods are not meant to address more focused questions related to specific taxa. The departure from more general inquiries about the overall community differences to analyses that focus on specific taxa is likely where the greatest advancement in knowledge of the human microbiome will come from. This transition is apparent in recent publications

[37],

[38]. To further proceed in this direction, we have proposed an initial strategy for comparisons between two groups and have shown it is appropriate for the specific attributes of microbiome data, irrespective of sample type, phylogenetic level and sequencing technology.