emm genotype markers for tissue site preference for infection.
Based on knowledge of emm
type, the emm
pattern genotype can be inferred with reasonable accuracy because organisms sharing an emm
type are very rarely associated with different emm
pattern genotypes (48
). Figure S2 in the supplemental material provides a comprehensive summary of recent epidemiological surveys (1980 to 2009) for population-based collections of GAS causing superficial throat or skin infections (10
pattern genotype is inferred based on emm
type, and the percentage of total isolates corresponding to each emm
pattern genotype is shown. This meta-analysis shows that as a group, GAS strains with the emm
pattern A-to-C (herein designated “A-C”) genotype are strongly associated with throat infections (t
< 0.0001; paired t
test, two-tailed) (see Fig. S2) and therefore designated “throat specialists.” Overall, pattern A-C strains account for 46.2% of pharyngitis isolates but only 8.4% of impetigo isolates. In sharp contrast, the emm
pattern D strains account for 50.4% of impetigo isolates, but only 1.8% of pharyngitis isolates (t
< 0.0001); they are designated “skin specialists.” Strains of emm
pattern E genotype account for almost equal fractions of superficial throat and skin infections (45.4 and 47.8%, respectively); as a group, they are designated “generalists.”
The global epidemiology meta-analysis supports the conclusion that GAS strains harboring the emm
pattern D genotype are far better adapted to occupying the superficial skin infection niche than are pattern A-C strains, confirming earlier findings based on fewer strain collections (12
). Similarly, strains of the emm
pattern A-C genotype are better equipped to fill the superficial throat infection niche than are pattern D strains. The emm
pattern E strain group is equally well adapted to both niches of infected host tissue. Thus, the emm
pattern genotype is a valuable biomarker for tissue tropisms resulting in infection and is used throughout this report.
Complete genome sequence of a classical skin strain.
Finished (gap-free) genome sequences are available for 13 GAS strains representing 10 emm
types, all of which are assigned to the emm
pattern A-C or E genotype (3
). As a first step toward gaining further insight into the genetic basis for tissue tropisms, a finished genome sequence was determined for the emm
pattern D genotype strain, Alab49 (emm
type 53). The genome of Alab49 is 1.827 Mb, which is close in size to the other finished GAS genomes; there are 1,804 predicted protein-coding genes. Despite being the sole emm
pattern D strain among GAS with sequenced genomes, Alab49 has only 17 unique genes based on a combined 80% identity and 80% coverage threshold value applied to BLASTn results (data not shown).
Identification of GAS accessory genes by CGH.
This study specifically seeks to identify GAS noncore accessory genes that may play a critical role in establishing tissue site preferences for infection. To achieve this goal, the Alab49 reference strain plus 96 genetically diverse test strains were subject to comparative genome hybridization (CGH) to a GAS pangenome microarray. Represented among the set of 97 GAS strains is a wide spectrum of emm types and geographic places of isolation (see Table S1 in the supplemental material). Twelve isolates whose complete genome sequence is established are included; they also serve as controls for CGH. The set of 97 GAS strains is represented by 94 emm types and 95 sequence types (STs), defined by MLST of seven housekeeping loci (see Table S1); GAS strains were recovered from 18 countries and all six major continents over a time frame spanning >60 years. Both highly prevalent and rare emm types are included; each strain is given equal weight in the long-term evolution analysis that follows, because strains of emm types having low prevalence in recent decades may be the dwindling progeny of major clones of previous eras or alternatively, newly emerged or reemerged clones. Of the 97 GAS strains in the set, 20 (20.6%) have the emm pattern A-C genotype, 33 (34%) have the emm pattern D genotype, and 44 (45.4%) have the emm pattern E genotype (see Table S1).
The gene content of the 97 diverse GAS strains was determined. The amplicon-based microarray used for CGH gives complete coverage of the GAS pangenome, including targets corresponding to the newly sequenced genome of Alab49, plus multiple alleles of highly divergent loci. Of the 3,388 targets that were spotted onto the microarray, 2,440 (72%) are based on Alab49 sequences. Overall, 3,366 (99%) targets yielded CGH data that are reliable. A finding of weak or no hybridization of genomic DNA to the microarray signifies either gene absence or extensive sequence divergence. For Alab49-based targets, a mean hybridization signal ratio of >2.8 (i.e., present in the Alab49 strain hybridized as a reference and absent from the test strain) was used to define gene presence; for non-Alab49-based targets, a mean hybridization signal ratio of <0.5 (i.e., absent in Alab49, present in the test) was used to define gene absence (see Fig. S1 in the supplemental material).
An accessory gene that confers the tissue specificity phenotype via a mechanism used by many strains should be neither too rare nor too common in its distribution among the set of 97 GAS strains; thus, very-low- and very-high-frequency genes are excluded from further analysis. Spots on the microarray having a presence ranging from 15 to 85% among the set of 97 strains led to identification of 393 differentially distributed targets (for the data set, see Table S2 in the supplemental material). Thus, genes corresponding to the 393 targets are a subset of accessory genes and exclude genes that are either rare (present in <15% of strains) or nearly as common as core genes (present in >85% of strains).
CGH findings were validated using PCR amplification of 16 genes with a differential distribution among GAS strains. The CGH findings on the presence/absence of a gene are ≥94% concordant with PCR-based findings for 13 of 16 genes and ≥85% concordant for 15 of 16 genes (data not shown). Thus, the CGH and PCR-based findings are highly concordant, even though the oligonucleotide primers used in PCR amplification do not always overlap with the amplicon spotted onto the microarray, and several genes have high sequence heterogeneity. Although the CGH findings are highly reliable for most genes, the signal ratio threshold for assigning presence or absence may be too stringent for some genes. Nonetheless, the stringent cutoff values employed, as established in Fig. S1 in the supplemental material, reduce the number of accessory genes mistakenly assigned due to noise. In summary, 393 differentially distributed gene targets comprise the GAS accessory gene pool that fulfills the stated criteria for testing as a tissue tropism candidate.
Organization of nonprophage AGRs.
Of the 393 differentially distributed targets identified by CGH, 164 (41.7%) match sequences within the Alab49 reference genome (for the data set, see Table S2 in the supplemental material). depicts the distribution of Alab49-derived genes among the 97 GAS strains under study; each concentric circle represents the CGH findings for an individual strain. Numerous regions of genetic diversity are evident, several of which correspond to prophage. The four prophage and prophage-like elements of the Alab49 genome account for 119 (72.6%) of the Alab49-based targets (see Table S3). Most remaining accessory gene targets matching Alab49 (n = 41) correspond to one of 11 nonprophage AGRs (; ). Of the 229 differentially distributed targets that do not hybridize to Alab49, 132 match prophage genes in ≥1 of the sequenced GAS genomes or standalone transposases (for the data set, see Table S2). The remaining 97 targets correspond to 11 AGRs that are absent from Alab49 (). Overall, 22 AGRs were identified by CGH data analysis (). A linear map shows the relative position of the 22 AGRs of GAS ().
Fig. 1. Distribution of Alab49-derived genes among 96 GAS strains. The circular map summarizes CGH findings for Alab49-derived targets. The tick marks correspond to hybridization signal ratios of <2.8 (i.e., absent from test strain); each concentric circle (more ...)
Fig. 2. Linear genome map of accessory gene regions of GAS. The core GAS genome constitutes ~1.57 Mb, and the relative position of AGRs is indicated. AGRs present in strain Alab49 are in yellow (above bar); also shown are the positions of four prophage (more ...)
Figure S3 in the supplemental material depicts the gene content and arrangement of each of the 22 AGRs. Three AGRs have two distinct structural forms that are mutually exclusive (AGR-16A/16B, -17A/17B, and -18A/18B) (). At least five other AGRs having alternative forms (AGR-1/1X, -2/2X, -20/20X, -21/21X, and -22/22X) contain indels that arose via recombination of full-length genes. The emm pattern genotype is contained within AGR-21/21X (; also see Fig. S3-U). Only 6 of the 17 genes unique to Alab49 are in the subset of 393 differentially distributed targets; the remaining 11 genes unique to Alab49 are rare among the set of 97 GAS strains and unlikely to confer the skin tropism phenotype via a mechanism that is widely shared.
Negative genetic associations: throat versus skin specialist genotypes.
CGH findings for each of the 393 microarray targets corresponding to accessory genes underwent association analysis using a conservative statistical test for independence. Linkage disequilibrium (LD) measurements comparing strains having the emm pattern A-C versus D genotype reveals significant (P < 0.05) differences for 43 accessory gene targets (Fisher's exact test, two-tailed, BH method-corrected values); 21 of the 43 are highly significant (P < 0.01) (; also see the data set in Table S2 in the supplemental material).
LD between AGRs and 97 GAS strains according to emm pattern genotype
Of the 43 accessory gene targets displaying significant negative associations between emm
pattern genotype A-C versus D strains, 22 lie within nonprophage AGRs. Five are in the emm
region (AGR-21/21X), encoding the transcriptional regulator Mga, the fibronectin (Fn)-binding protein FbaA, the collagen-like protein SclA, and two M or M-like proteins (see Fig. S3-U in the supplemental material). Seven genes that distinguish throat and skin specialist strains lie within AGR-2/2X (FCT region) (see Fig. S3-B in the supplemental material), which maps ~0.3 Mb from the emm
region and contains genes for pilus biosynthesis (11
). Ten negatively linked accessory genes reside within AGR-17A/17B; all 33 GAS strains of the emm
pattern D genotype harbor AGR-17A (; also see the data set in Table S2 in the supplemental material), containing a putative magnesium transporter (CorA), NAD-dependent oxidoreductase (MocA), and a hypothetical protein (see Fig. S3-Q). The mutually exclusive AGR-17B constitutes a Cas-CRISPR region that is restricted to GAS strains of the emm
pattern A-C or E genotypes. Of the 21 remaining accessory gene targets displaying strong negative associations between emm
pattern genotype A-C versus D strains, all are associated with mobile genetic elements (i.e., prophage, transposases) or their remnants; based on gene annotation, none of the prophage genes displaying significant LD with the emm
pattern A-C versus D genotype corresponds to a probable virulence factor. Taken together, the data indicate that accessory genes conferring tissue site specificity for throat versus skin infection most likely reside within AGR-2/2X, AGR-17A/17B, and/or AGR-21/21X ().
Negative genetic associations involving generalists.
Most instances of statistically significant LD between accessory genes and emm pattern genotype involve the generalists (), which account for ~50% of isolates worldwide (see Fig. S2 in the supplemental material). Comparison of GAS strains of the emm pattern E genotype to each group of specialist strains reveals that 116 of the accessory gene targets exhibit statistically significant levels of LD (for the data set, see Table S2); of these, 39 targets correspond to prophage, phage remnants, or transposases. Seventy-six of the 116 negatively linked accessory gene targets are associated with 12 AGRs (). Nine of the 12 AGRs have 1 or more loci that display a significant difference (P < 0.05; Fisher's exact test with BH correction) in its distribution among emm pattern groups A-C versus E, whereas all 12 of the AGRs have one or more genes that are significantly different for the pattern D versus E genotypes. The most highly significant levels of LD (P < 1.00E−04) for emm pattern genotype A-C versus E strains are restricted to genes within AGR-21/21X (emm region). In contrast, the highest levels of statistically significant LD for emm pattern genotype D versus E strains (P < 1.00E−04) lie within six AGRs (for the data set, see Table S2); for the AGRs that do not have an alternate form, the AGR is present in most generalist strains and absent from the vast majority of skin specialist strains.
Population genetic structure of GAS based on AGRs.
The finding that more than half of the AGRs display significant LD among the different emm genotype-defined groups of strains () indicates that the frequency of many AGRs is highly skewed insofar as their distribution across the three ecological groups. Thus, it was of interest to define subpopulations of GAS based on AGR content and to determine the extent by which they correlate with the emm pattern-defined groupings. Each strain is assigned an AGR-based haplotype, whereby a single representative locus from each of 21 AGRs was converted to a binary character (presence or absence) based on CGH findings (for the data set, see Table S2 in the supplemental material); the character states are combined to yield the haplotype (see Table S4). Excluded from the haplotype is AGR-21, which contains the emm pattern marker. Among the 97 GAS strains, there are 96 unique AGR-based haplotypes of 21 characters, indicative of high diversity.
The AGR-based haplotypes were used to determine the population genetic structure of GAS, inferred using a Bayesian nonhierarchical clustering method implemented in STRUCTURE (version 2.3) (53
). Because AGR-21 is not part of the haplotype, population assignments are independent of the biomarker for tissue site preference of infection. In the mixed ancestry (or admixture) model, each strain draws some fraction of its genome from each population. The number of populations is user defined; assuming two populations (K
= 2) for the GAS data set of haplotypes, the vast majority of strains having the emm
pattern D or E genotype are highly homogeneous and fall into discrete groupings (). The cluster 2 population accounts for most of the fractional content of each emm
pattern D strain (mean average, 0.869), whereas most of the fractional content (0.861) of pattern E strains is assigned to cluster 1 (; also see Table S5 in the supplemental material). Overall, only 15.9% of strains having the pattern E genotype either fall into the pattern D dominant cluster or are >30% admixed; similarly, only 12.1% of strains having the pattern D genotype fall into the pattern E-like cluster or are heavily admixed.
Fig. 3. Population structure of GAS based on AGR content. STRUCTURE (version 2.3.3) was used to assess GAS population structure. It assumes there are K populations, each defined by a set of character state (presence/absence) frequencies at 21 loci. The loci (listed (more ...)
In contrast to emm pattern D and E strains, strains having the emm pattern A-C genotype do not sort into a single dominant cluster (K = 2) (; also see Table S5 in the supplemental material). Slightly more than half (60%) are pattern D-like (e.g., emm types 3, 5, 6, and 18), and the other strains are either pattern E-like (e.g., emm types 1, 12, and 55) or resemble an admixture of the two populations. Although highly mixed, the throat specialist group has a somewhat closer resemblance to skin specialists than to generalists.
Varying the number of user-defined populations reveals different substructures. Increasing the number of populations to K = 3 yields two clusters (1 and 2) that together account for a very high fractional content of emm pattern D strains, whereas emm pattern E strains are largely assorted into the cluster 3 population (; also see Table S5 in the supplemental material). Once again, the fractional contents of emm pattern A-C strains are split among the clusters that dominate the emm pattern D and E groupings. Increasing the K value to 4 or 5 fails to resolve a discrete population for the emm pattern A-C strains, and instead there is increasing overlap with the emm pattern D and E profiles. Of note, increasing K from 2 to 5 (or higher) has minimal impact on the log-likelihood probability value. In summary, population genetic structure modeling provides evidence that strains with the emm pattern D and E genotypes comprise discrete AGR-defined populations, whereas the group of emm pattern A-C strains lack a cohesive population structure, and many individual strains overlap with the pattern D or E-associated subpopulations.
Evolutionary genetics of GAS based on AGRs.
The 21 AGR binary character-based haplotypes used for population genetic structure analysis () can provide input data for phylogenetic analysis. A phylogenetic network, rather than a tree, is a more appropriate model for evaluating evolutionary relationships among members of a highly recombined population. SplitsTree4 (33
) provides a visualization of the incompatibilities between phylogenetic signals and was used to compute a phylogenetic network for the 97 GAS strains. The neighbor net method shows segregation of the vast majority of emm
pattern D and E strains at opposite ends of the network (). A minimum spanning network reveals perhaps an even sharper distinction between skin specialist and generalist strains (). These findings imply that there are many shared evolutionary steps for most strains assigned to the emm
pattern D grouping or to the emm
pattern E grouping. Outlier strains—pattern D strains that are pattern E-like and vice versa—are the same strains that were outliers by STRUCTURE (). In sharp contrast to the strong clustering observed for each of the emm
pattern D and E groups, emm
pattern A-C strains are widely scattered throughout the AGR-based networks ( and B). This finding is consistent with high levels of heterogeneity in AGR content among the throat specialist group and many nonoverlapping genetic pathways for the emergence of individual throat specialist strains.
Fig. 4. Networked evolution of GAS. SplitsTrees graphs are based on 97 GAS strains for panels A and B, with 21 binary characters denoting the presence or absence of genes representative of AGRs (see Table S4 in the supplemental material), or, for panel C, concatenates (more ...)
The AGRs display an evolutionary dynamic that is distinct from that of the core housekeeping genes. Concatenated nucleotide sequences of seven core housekeeping genes from each of the 97 GAS strains reveal a relative lack of clustering based on emm
pattern genotype (), a finding that is consistent with other studies showing ample flow of housekeeping genes between the different emm
). Thus, against a background of extensive core housekeeping gene exchange, the strong associations among AGRs are most likely to arise as the consequence of strong coselection pressure, such as that encountered during adaptation to an ecological niche.
Model for GAS evolution based on strongly associated AGRs.
A key objective is to delineate the sequence of evolutionary events that underlie the preferred tissue site of infection phenotypes of GAS. The phylogenetic networks based on the presence/absence of AGRs ( and B) represent an evolutionary history involving a very large number of plausible ancestral states, making it very difficult to correlate the gain or loss of particular AGRs with the emergence of the ecologically distinct groups of GAS strains. To help untangle the evolutionary pathways of highly recombined bacteria, we present a new model for GAS evolution based only on the AGR pairs that have highly significant genetic associations. In this new model, recombination events that are far less likely to contribute to the critical adaptive steps are effectively masked.
As a first step, a comprehensive analysis of gene-to-gene interactions among GAS was undertaken, and the degree of association for all possible combinations of the 393 differentially distributed gene targets was calculated. Out of the >77,000 unique pairwise comparisons, 4,685 display statistically significant nonrandom associations (P < 0.05; Fisher's exact test, BH corrected); of the pairs of AGR-associated genes, 1,991 (43%) involve two different AGRs.
Next, the identity of genes exhibiting the strongest nonrandom associations was established. provides a summary of the smallest P value (BH corrected) for any pair of genes belonging to different AGRs. Of the 428 unique two-way comparisons among the 22 AGRs, including alternative forms, 163 AGR combinations have P values of <0.05 for at least one pair of genes. Genes belonging to five AGRs (AGR-2, -13, -14, -17, and -21) display extraordinarily strong associations with genes of a second AGR, yielding P values of <1.00E−07 (, yellow boxes); an additional three AGRs (AGR-3, -4, and -16) display corrected P values of <1.00E−05 (green boxes). The eight AGRs having a very strong association (P < 1.00E−05) with another AGR give rise to 15 unique pairwise combinations of different AGRs; for these 15 pairs, negative versus positive association was established (). For example, for the positively associated gene pair of AGR-17B and AGR-13, 79% of the 97 GAS strains either lack or harbor genes from both AGRs.
Fig. 5. Positive and negative linkage between genes of different AGRs. The lowest BH-corrected P value (Fisher's exact test, two-tailed) for any gene pair assigned to each possible combination of AGRs is as follows: P < 5.00E−02 (purple), P < (more ...)
The possible role of coinheritance was assessed by evaluating the relative map positions of the eight AGRs having strong associations. Of the 15 unique pairs, only AGR-13 and AGR-14 lie adjacent on the GAS genome and lack an intervening AGR or prophage with a random distribution (); the shortest inter-AGR distance for any of the 15 strongly associated AGR pairs is 65.1 kb, between AGR-13 and AGR-14 (). Thus, if positively associated AGRs conferring an adaptive advantage were inherited via a single genetic event, subsequent recombination appears to have freely occurred in nearby nonadaptive regions without a loss of fitness, for at least 14 of the 15 strongly associated AGR pairs. Therefore, one can conclude that nearly all pairs of strongly associated AGRs are subject to strong coselection pressures.
To mask genetic changes unlikely to contribute to niche adaptations, modified haplotypes based on highly linked genes only are used. For each of the eight AGRs displaying a very high degree of association (P < 1.00E−05) (), the locus within each AGR having the lowest P value when paired with any gene of another AGR was chosen for input data (see Table S6 in the supplemental material). Among the 97 GAS strains, there are 34 haplotypes defined by the presence or absence of genes of the eight strongly associated AGRs (see Table S7); 14 haplotypes are represented by multiple strains.
Haplotypes based on the strongly associated AGRs provide input data for the new evolutionary model. reveals that 26 of the 34 haplotypes form a large cluster whereby every haplotype differs from ≥1 other haplotype by one genetic step. Overall, there are 31 possible connections (i.e., single locus variants) between haplotypes belonging to the major cluster. Within the major cluster, nearly all skin specialist strains (green) form a tight genetic network that is discrete from a second network dominated by generalist strains (red); a single haplotype (H24) provides the only connection between the two major subclusters (I, upper subcluster and II, lower subcluster). The data showing a division between skin specialists and generalists strongly parallel the findings obtained via STRUCTURE and SplitTrees analysis ( and ), which are based on all AGRs except AGR-21, which harbors the emm pattern genotype.
Fig. 6. Model of evolution for GAS based on highly linked AGRs. The locus within each AGR having the lowest P value, when compared to any gene of another AGR, is used to define character states (0, absence; 1, presence) for AGRs 2X, 3, 4, 13, 14, 16B, 17A, and (more ...)
Throat specialist haplotypes (blue) are preferentially associated with the skin specialist subcluster I of the evolutionary network (). However, a pattern A-C emm
) that accounts for a substantial fraction of pharyngitis cases in recent decades (59
) belongs to a haplotype pair (H8/H12) that differs from those of the main network by ≥2 of the 8 AGRs (see Table S7 in the supplemental material). Another highly prevalent pattern A-C emm
) is part of subcluster II (H11) (). Overall, emm
pattern A-C genotypes are highly diverse in AGR content, which is suggestive of multiple genetic pathways associated with adaptation to their narrow niche.
Gain/loss of Fn-binding protein genes in GAS evolution.
A striking feature of the evolutionary network () is the transition between skin specialists and numerous throat specialist strains within subcluster I, involving haplotypes H17 and H19. Haplotypes H17 and H19 differ in AGR-2/2X (FCT-region) content, whereby the haplotype-defining gene (prtF1) encodes a fibronectin (Fn)-binding protein (see Table S6 in the supplemental material). The data are consistent with the notion that gain or loss of prtF1 is a key genetic step that facilitates the exploitation of a new niche by many tissue specialist strains.
Several differentially distributed genes encoding Fn-binding proteins are present in GAS, and all lie within AGR-2/2X (prtF1, prtF2) or AGR-21/21X (fbaA, sfbX, sof). The haplotypes of the strains in the skin specialist-dominant subcluster I were further refined by incorporating the character state (presence or absence) of the five Fn-binding protein genes into the haplotype definition (; also see Table S7 in the supplemental material). Although the direction of the ancestor-descendant relationships is difficult to ascertain, transitions between throat and skin specialists are marked by changes in Fn-binding gene content, whereby throat specialists tend to have prtF1, and skin specialists tend to have fbaA. Thus, changes in the content of Fn-binding protein genes, or in closely mapping loci, strongly correlate with fundamental niche adaptations by GAS.
Fig. 7. Model of evolution showing gain and loss of fibronectin-binding protein genes. Detail of upper subcluster I presented in , showing an extended haplotype that incorporates the character state of Fn-binding protein loci within AGR-2/2X and AGR-21/21X (more ...)