The first model for supragenome (or pan-genome) analysis [
19] was developed using genomic DNA sequence datasets from eight strains of the species
Streptococcus agalactiae, also known as group B
Streptococcus (GBS). This model was developed by fitting an exponential decay function to a plot of the average number of
core genes observed with increasing numbers of genomes examined (where the average was taken for all possible permutations of the order of consideration of the genomes under study), and took the asymptote defined by such a plot as an estimate of the size of the GBS
core genome. This model also fitted a second exponential decay function to a plot of the average number of
new genes observed with increasing numbers of genomes examined (where the average is taken as before), and took the asymptote as an estimate of the number of
new genes that would be observed with each new GBS genome sequenced. Finally, this model also estimated the
size of the GBS pan-genome by deriving a third equation for its rate of growth. A recent review [
31] proposed a revised version of this model that adopts a power law fit (Heaps' Law, from the field of information retrieval) in lieu of the earlier exponential fit of the observed data. In both the original and the power law models, a threshold parameter (α) is used to distinguish
open and
closed pan-genomes, where an open pan-genome (with α ≤ 1) is defined as one that will yield a non-zero number of new genes when each additional genome of the species is sequenced. More recently, with the advent of ever less expensive sequencing technologies, making it possible to sequence scores of independent strains, it could be argued that modeling of the supra/pan-genome is unnecessary since sequencing of additional strains can be continued until no significant number of novel genes are identified [
32]
The probabilistic foundation of the model used in this work [
21] offers a somewhat different perspective, but with the improvements described above that take into account multiple gene frequency classes allows for accurate supragenome modeling of populations/species for which it is not possible to obtain multiple independent clonal lineages (i.e. unculturable organisms) for sequence analysis. In these cases gene frequency were inferred from the different sequence coverage levels observed within the sequenced population. Since the vast majority of bacterial species are not culturable, but are now amenable to whole genome sequencing through single cell isolation and whole genome amplification techniques [
33] our model can be used to estimate the percentage of the supragenome that has been obtained at intermediate coverage levels.
A recent comparison of the pan-genome model of Tettelin et al. with our finite supragenome model demonstrated that the two models make highly similar predictions when provided the same dataset [
26], thus serving as a validation for both. However, both models share fundamental challenges in areas such as the selection of appropriate genomic DNA sequence datasets to use. For example, we chose not to include
S. aureus plasmid DNA sequences, e.g., those associated with the published genomes that were used in our analysis (Table ), nor the DNA sequences of the many
S. aureus bacteriophage genomes that have been published [
34]. However, our analysis included three draft genomes from
S. aureus strains that we sequenced, and these unfinished genomes may contain plasmid-derived contigs. The results in Table in fact suggest that the strain CGSSa01 may contain one or more
S. aureus plasmids. Its genome contains the largest number of genes (2,648), and the 18 unique genes it contains are significantly greater than all but two of the other 17 strains. A comprehensive review of the
S. aureus genome [
35] provides a detailed description of some of the many plasmids and other mobile elements that it may contain. Decisions about the inclusion or exclusion of published plasmid and bacteriophage DNA sequences in a supragenome analysis can therefore lead to systematic error in estimates of counts of different classes of genes (e.g., core and unique genes).
Other issues arise during the selection of appropriate genomic DNA sequence datasets to use in a supragenome analysis. In this work we included genomes that are known to be very closely related, e.g., those of strains JH1 and JH9 [
28], as well as one known outlier genome (RF122) of bovine origin [
36]. These decisions can also be criticized as leading to systematic error in estimates of counts of different classes of genes. However, inclusion of a limited number of closely related and outlier strains also provide for useful internal controls for the results of the supragenome analysis. In some respects, and especially for a species such as
S. aureus, bias in the selection of genomic DNA sequence datasets to use in a supragenome analysis is unavoidable. Given the intense interest in clinically relevant strains of
S. aureus, one can reasonably expect that even with the ever increasing affordability and subsequent availability of bacterial genomic DNA sequence datasets, the
S. aureus strains selected for sequencing will for the foreseeable future be dominated by epidemiologically important clinical strains, and that the much more quantitatively representative commensal
S. aureus stains that could be isolated from human subjects will be under-represented in supragenome and other comparative genomic analyses.
The
S. aureus core genome has been found to contain a heptameric DNA sequence (GAAGCGG) that is believed to protect it from uncontrolled rearrangements [
37]. This conserved
crossover hotspot initiator or
chi site is not the only DNA sequence motif and associated nucleic acid information processing system with a putative influence on the structure and maintenance of the
S. aureus supragenome. The
Sau1 Type I restriction-modification (RM) system has at least two important influences in this regard [
38]. First, it reduces the efficiency of conjugation between
S. aureus and enterococci, the putative source of vancomycin resistance. This is believed to explain why very few vancomycin-resistant
S. aureus strains have arisen, despite tremendous selective pressure acting on the bacterial flora of patients treated with this drug [
39]. Second, the
Sau1 RM system's multi-copy specificity gene,
sau1hsdS, has many alleles with significant population frequencies, and these alleles correspond to the major
S. aureus lineages. Five copies of this gene occur in the RF122 genome; three copies in the genomes of CGSSa01, MSSA476, USA300, and USA300TCH15, and two copies in the remaining genomes we examined (i.e., it appears to be a
multi-copy core gene found in all genomes examined). The
Sau1 RM system therefore not only controls horizontal gene flow into
S. aureus from other species, but also within
S. aureus lineages via the polymorphic RM specificity alleles of
sau1hsdS loci [
38].
Analysis of the supragenome of
S. aureus isolates from antibiotic-naïve populations would be an interesting extension of this work. Paradoxically, antibiotic treatment increases
S. aureus conjugation frequency [
40]; induces
S. aureus temperate phage to excise, replicate, and transfer pathogenicity islands [
41]; and when used in combination therapies may unexpectedly increase the spread of resistance among
S. aureus strains [
42]. In some bacterial genomes the selection pressure exerted by antibiotic exposure may also have the unexpected effect of promoting multi-drug resistance due to
positive epistasis amongst combinations of alleles of antibiotic resistance loci [
43]. Thus, one might expect that
S. aureus isolates from antibiotic-naïve populations would yield an estimate of supragenome size smaller than that reported here, and be comprised of an even larger percentage of core genes. The results of a supragenome analysis may therefore represent an aggregate of the results of environmental niche-specific supragenomes affected by extrinsic agents such as antibiotics that modulate horizontal gene transfer, as well as regulatory allele-specific supragenomes affected by intrinsic genetic phenomena (such as the
Sal1 system) that also modulate horizontal gene transfer.
Another interesting extension of this work would be an analysis of the supragenome of the related species
Staphylococcus sciuri, which is one of the most abundant staphylococcal species and a frequent epidermal commensal of animals [
44]. The
mecA gene found in methicillin-resistant
S. aureus (MRSA) strains encodes the PBP2A penicillin-binding protein, whose affinity for β-lactam antibiotics acts as a sink that vitiates the efficacy of these drugs and protects native
S. aureus PBPs during their function as bacterial cell wall synthetic enzymes [
45]. Incorporation of the
mecA gene into the
S. aureus genome is an unusual event, and requires both a delivery entity called the staphylococcal chromosome cassette (SCC), and a suitable but rarely encountered
S. aureus genetic background that can tolerate the presence and expression of the
mecA gene [
45].
Staphylococcus sciuri, although susceptible to β-lactam antibiotics, is believed to be the source of the precursor homolog of the
mecA gene present in the limited number of MRSA strains that have emerged worldwide [
45]. Although limited in number, MRSA strains have spread in an epidemic manner with devastating clinical consequences. The
S. sciuri genome appears to be ubiquitously agreeable to the presence and expression of its
mecA precursor homolog [
44], and hence a study of the
S. sciuri supragenome may yield insights into the genetic determinants whose homologs in, or horizontal acquisition by, a
S. aureus genome may predispose to the acquisition of
mecA.