The use of a binomial mixture model for estimating the pan-genome size was introduced by [
7], but the use of mixture models for population size estimation is by no way new,
e.g. [
8,
10,
16]. The estimation of a population size has a long history in ecology, under the names of capture-recapture problems (
e.g. [
17]), or in epidemiology, called multiple record systems (
e.g. [
18]). Mixture models are suitable when we are faced with a larger number of recaptures/records/genomes and heterogeneous detection probabilities, which is exactly the case for pan-genomics.
From our results in Figure we notice that for none of the species the optimal mixture model has 2 components. This would be expected if the gene pool could be divided into core-genes and dispensable genes, as implicitly assumed by [
2,
6]. There is always at least a third group, and frequently even more. This observation corresponds to the results shown by [
19], where they find that for bacteria and archaea in general, genes could be divided into three classes; core (always occurring), shell (moderately occurring) and cloud (rarely occurring).
A reason for this heterogeneity in detection probabilities may be skewed sampling. If some of the sequenced genomes are sampled in the same "corner" of the population, the genes characteristic for this "corner" will occur more frequently than they should. Another reason may be that some genes are simply frequently occurring in the population, reflecting a divergence from a fairly recent ancestor. In this perspective, it must be expected that there is a large number of true detection probabilities, which is at least partly supported by the fact that the more genomes we consider the more components we estimate (see Figure ).
The fact that microbial genomic diversity is caused by both vertical mutations and horizontal transfer makes it also plausible to expect heterogenous detection probabilities.
From Figure we also see that even for 22 genomes (
E. coli) we only estimate 6 components. In [
7] a mixture of 7 components were used for a data set of 8 genomes, which seems to be a too complex model. Using too complex mixture models will tend to over-estimate the pan-genome size, since it makes the estimate of the smallest detection probability artificially small.
In Figure we see that a larger sample pan-genome tends to result in a larger estimated pan-genome.
This is due to the fact that larger data sets allow more complex models, and more complex models allow more extreme estimates. Uncertainties, as indicated by the rough confidence intervals, also tend to grow when estimates grow, which is reasonable.
In Figure we have constructed a way to plot the estimated mixture models for comparative pan-genomics. In this picture the actual size of the core- and pan-genome is not visible, but we focus instead on the relative distribution of detection probabilities. Some species, typically have a large proportion of stable genes (blue area), while at the other end of the scale we find those with little overlap between genomes. A larger number of components indicates a more complex pan-genome with respect to heterogeneity in detection probabilities.
From the results in Figure we can compute the coverage for each species, which is simply the size of the sample pan-genome divided by the estimated pan-genome size. Ideally, we should expect this to increase as the number of genomes increase, because the sample pan size should approach the true pan size. There is no such tendency in our results. We even observe that two of the largest data sets (S. enterica and E. coli) have two of the smallest coverages. Figure also clearly demonstrates that, at least for E. coli, as more genomes become available the pan-genome estimates get even higher. This is typical for a population with a large fraction of ORFans. Since ORFans have a small detection probability, only a few of them will show up in every genome. Hence, it requires a substantial number of genomes before we can estimate their true abundance. In this perspective, the binomial mixture model will tend to under-estimate the true pan-size for smaller data sets.
In Table we show that there are effects of possible false positive predicted genes on the estimates of pan-genome size. By removing hypothetical proteins from the data set, the number of ORFans drops. This again leads to a decreased pan-size estimates. Predicting new genes with Easygene gives the largest reduction in ORFans, but the effect on the mixture model estimated pan-size is less. This is due to the fact that the mixture model depends on the entire data distribution, not only the ORFans.
Our approach assume a closed pan-genome, i.e. η is a parameter. In an open pan-genome, the total number of genes is not fixed, and in a very long term perspective this is most likely the case, assuming new genes form and old genes disappear. However, in a reasonably short time window, the number of genes available to any population must be limited, and can be assumed constant. Wether genes are shared vertically or horizontally within the population should have no impact on the closedness of the gene pool.
A recent publication [
20] has suggested alternative ways of estimating pan-genome size, based on power-laws and regression. Our, more probabilistic approach, is fundamentally different, and more in line with existing methods in capture-recapture modelling. However, as suggested by the results in Table , a major problem in pan-genome size estimation is the fact that the data themselves are estimates, and thus the uncertainty in the computation of gene families will influence the results, sometimes severely. In order to improve the estimation of bacterial genomic diversity, future efforts should probably be focused on this aspect.