|Home | About | Journals | Submit | Contact Us | Français|
Group A Streptococcus (GAS) has a rich evolutionary history of horizontal transfer among its core genes. Yet, despite extensive genetic mixing, GAS strains have discrete ecological phenotypes. To further our understanding of the molecular basis for ecological phenotypes, comparative genomic hybridization of a set of 97 diverse strains to a GAS pangenome microarray was undertaken, and the association of accessory genes with emm genotypes that define tissue tropisms for infection was determined. Of the 22 nonprophage accessory gene regions (AGRs) identified, only 3 account for all statistically significant linkage disequilibrium among strains having the genotypic biomarkers for throat versus skin infection specialists. Networked evolution and population structure analyses of loci representing each of the AGRs reveal that most strains with the skin specialist and generalist biomarkers form discrete clusters, whereas strains with the throat specialist biomarker are highly diverse. To identify coinherited and coselected accessory genes, the strength of genetic associations was determined for all possible pairwise combinations of accessory genes among the 97 GAS strains. Accessory genes showing very strong associations provide the basis for an evolutionary model, which reveals that a major transition between many throat and skin specialist haplotypes correlates with the gain or loss of genes encoding fibronectin-binding proteins. This study employs a novel synthesis of tools to help delineate the major genetic changes associated with key adaptive shifts in an extensively recombined bacterial species.
Understanding the evolution of niche specialization in a bacterial pathogen can lead to insights on the critical factors that are essential for disease. Group A Streptococcus (GAS; Streptococcus pyogenes) is an important bacterial pathogen, infecting humans only and having no other known reservoirs of consequence. GAS disease is highly prevalent throughout the world and accounts for ~750 million or more infections per year (17). The primary ecological niches for GAS are the epithelial surfaces of the upper respiratory tract and skin of the human host, and the vast majority of GAS infections involve one of these two tissues, resulting in pharyngitis and impetigo, respectively.
A purulent exudate (i.e., pus) is a prominent feature of superficial epithelial infections caused by GAS, arising from infiltrating leukocytes that transform the local environment into a nutrient-rich one which promotes the reproductive growth of GAS (12, 56). This, in turn, can facilitate the successful transmission of bacterial progeny to new hosts. Transmission is largely mediated through respiratory droplets or secretions and direct contact (14). GAS can also persist at the oropharynx in a dormant carrier state for many weeks or months, wherein the organism fails to cause disease symptoms (37). Carriage rates can exceed 20% in school-age populations, and throat carriage appears to constitute an important ecological niche, although carriers are far less likely to transmit than acutely ill persons (14). GAS can also colonize the skin several days prior to the development of skin lesions (29). Secondary transmission of GAS from an impetigo lesion to the upper respiratory tract is a well-established phenomenon and typically leads to a carrier state; however, transmission to the superficial skin from the throat is relatively rare (2, 14). In the absence of antibiotic treatment, acute infections of the epithelium usually resolve within 2 weeks by a host immune response directed to strain-specific antigens (14, 42). On very rare occasions, GAS gains access to deeper soft tissue, whereupon it can cause a severe illness.
M protein fibrils, which serve as essential virulence factors by preventing phagocytosis of the organism in the absence of strain-specific antibody, often provide the basis for distinguishing strains of GAS. The M proteins display extensive antigenic heterogeneity at the fibril tips, leading to the identification of >200 distinct emm types defined by the 5′ end gene sequence (4). emm type can be further classified according to emm pattern genotype, which is based on the 3′ end emm gene sequence combined with the chromosomal arrangement of emm and emm-like genes; there are three main emm pattern genotype groupings (9).
A notable feature of GAS is its extensive genetic recombination among different strains, as evidenced by multilocus sequence typing (MLST) data on core housekeeping genes (9, 20, 27, 31, 36, 48, 65). The emm gene also undergoes horizontal transfer, and therefore the utility of emm type as a strict definition for clone is somewhat limited (13, 24). Despite the high level of genetic diversity among GAS, many strains share key ecological properties. The rich epidemiological history of GAS disease reveals that superficial infection of the throat (pharyngitis) versus skin (impetigo) is largely caused by strains having discrete sets of M or emm types (2, 14, 16, 18, 21, 34, 47, 52, 67). This study sought to gain insight on the genetic basis for the two ecological phenotypes associated with primary GAS infections, using a population genomics approach.
The whole-genome sequence of strain Alab49, recovered from an impetigo lesion (Alabama; 1986), was determined using Sanger sequencing on ABI 3730xl machines. Two plasmid libraries were constructed, one with a target insert size of 2 to 3 kb (~7,500 reads) and one with a range of 6 to 8 kb (~11,000 reads); reads were assembled using Celera Assembler (50). All gaps between resulting contigs were closed manually via plasmid primer walking (sequencing gaps) or PCR sequencing (physical gaps). The resulting complete genome sequence was checked manually for assembly, sequence/clone coverage, and consensus base-calling accuracy. Open reading frames (ORFs) of the Alab49 genome were predicted and annotated using a suite of automated tools that combines the following: (i) Glimmer gene prediction (19, 55); (ii) ORF and non-ORF feature identification (e.g., protein motifs) via tRNAscan-SE (44), RNAmmer (41), hmmpfam (23, 39), blastp (1), SignalP (5), prosite (25), LipoP (35), and tmhmm (40); and (iii) assignment of database matches and functional role categories to genes (63), followed by partial manual annotation using Manatee (http://manatee.sourceforge.net/).
The pangenome microarray was constructed using the Alab49 genome as the reference. First, PCR primer pairs were designed for the predicted genes of Alab49, for the generation of amplicons. Next, additional genes present in strain SF370 but not already represented by an Alab49 amplicon were targeted for primer design. This cumulative design of primer pairs was performed using the following 12 strains, whose complete genome sequence is known, done sequentially in this order: Alab49, SF370, MGAS8232, MGAS315, MGAS10394, MGAS6180, MGAS5005, MGAS10270, MGAS10750, MGAS2096, Manfredo, and NZ131 (3, 7, 8, 28, 30, 32, 49, 51, 58, 61). All primers were designed using Primer3 (http://primer3.sourceforge.net) and a suite of in-house scripts; they were tested for uniqueness within the genome whose DNA was used for PCR amplification. The resulting 3,388 amplicons (microarray targets) ranged in size from 51 to 1,170 bp and included targets specific for variable regions of some orthologous genes across the 12 genomes; the mean average size of the targets is 213 nucleotides (nt) (median, 150 nt). Amplicon-based targets were spotted in triplicate on Corning UltraGap aminosilane slides (Corning) by using either a Lucidea (GE Healthcare) or Aushon Biosystems microarray-spotting robot. The quality and integrity of the microarrays were verified using SpotQC (Integrated DNA Technologies).
Cy3 and Cy5 (GE Life Sciences)-labeled cDNA probes were synthesized from genomic DNA (gDNA) isolated using the Easy-DNA kit (Invitrogen), as previously described (64). To eliminate a potential bias of unequal dye incorporation into cDNA, flip-dye experiments were performed for each hybridization, using Alab49 gDNA as the reference probe. Slides were scanned using the Axon GenePix4000B scanner (Molecular Devices), images were analyzed using TIGR Spotfinder (54), and data were normalized using iterative log-mean centering, as previously described (22). The mean normalized log2 (ratio of reference/test) of each locus was calculated from all spots with fluorescence data (two slides each with three spotted replicates, for a total of six spotted amplicons per locus). An amplicon was considered present in the reference strain Alab49 if it displayed >80% nt sequence identity with the Alab49 genome over >80% of its length. To help establish the most accurate definition for gene presence versus absence among the 96 test strains, the signal data ratios generated for each hybridization were plotted against the best percent identity of each amplicon when compared, by WU-BLASTN, to each of the 12 GAS genomes (see Fig. S1 in the supplemental material). Based on the plot, cutoff signal ratios used were as follows: a log2 mean ratio of >1.5 (equivalent to a ratio of 2.8) for Alab49 amplicons defining presence in Alab49 and absence from the test strain and a log2 mean ratio of less than −1.0 (equivalent to a ratio of 0.5) for non-Alab49 amplicons defining absence from Alab49 and presence in the test strain.
Each accessory gene region (AGR) was identified based on the corresponding map position of microarray targets showing a differential distribution among strains. The ORF boundary of each AGR was established by BLAST analysis using all sequenced GAS genomes as the reference. The genome map positions of Alab49 prophage were established based on predicted att sites and homology to previously identified prophages. Microarray targets were classified as phage or nonphage based on BLASTn analysis of the target query sequences to a database containing all known bacteriophage genes of GAS.
Population structure was inferred using STRUCTURE version 2.3 (53). Construction of a phylogenetic tree used the maximum parsimony method implemented in PAUP version 4.0b10. Phylogenetic networks were constructed via SplitsTrees version 4.10 (33).
Fisher's exact test for independence (two-tailed) was used to test the null hypothesis that there is no difference in the numbers of strains from two emm pattern groups testing positive for a given gene. All pairwise comparisons involving the 3 emm pattern groups were performed for 393 gene targets (1,179 tests), whereby the vast majority of gene targets on the microarray correspond to distinct loci. The Benjamini and Hochberg (BH) method was used to adjust the P value and decrease the probability of type I errors; the BH method controls the false-discovery rate and can lead to a substantial gain in power when compared to the Bonferroni adjustment (6); statistical analysis was conducted using the R software package. Fisher's exact test for independence with the BH correction was also used for testing the null hypothesis that the association of all possible pairwise combinations of the 393 differentially distributed targets is random.
The sequence of strain Alab49 is available through GenBank accession number CP003068. The Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) series accession number for the microarray data of this study is GSE31860.
Based on knowledge of emm type, the emm pattern genotype can be inferred with reasonable accuracy because organisms sharing an emm type are very rarely associated with different emm pattern genotypes (48). Figure S2 in the supplemental material provides a comprehensive summary of recent epidemiological surveys (1980 to 2009) for population-based collections of GAS causing superficial throat or skin infections (10, 57, 59, 60); emm pattern genotype is inferred based on emm type, and the percentage of total isolates corresponding to each emm pattern genotype is shown. This meta-analysis shows that as a group, GAS strains with the emm pattern A-to-C (herein designated “A-C”) genotype are strongly associated with throat infections (t < 0.0001; paired t test, two-tailed) (see Fig. S2) and therefore designated “throat specialists.” Overall, pattern A-C strains account for 46.2% of pharyngitis isolates but only 8.4% of impetigo isolates. In sharp contrast, the emm pattern D strains account for 50.4% of impetigo isolates, but only 1.8% of pharyngitis isolates (t < 0.0001); they are designated “skin specialists.” Strains of emm pattern E genotype account for almost equal fractions of superficial throat and skin infections (45.4 and 47.8%, respectively); as a group, they are designated “generalists.”
The global epidemiology meta-analysis supports the conclusion that GAS strains harboring the emm pattern D genotype are far better adapted to occupying the superficial skin infection niche than are pattern A-C strains, confirming earlier findings based on fewer strain collections (12). Similarly, strains of the emm pattern A-C genotype are better equipped to fill the superficial throat infection niche than are pattern D strains. The emm pattern E strain group is equally well adapted to both niches of infected host tissue. Thus, the emm pattern genotype is a valuable biomarker for tissue tropisms resulting in infection and is used throughout this report.
Finished (gap-free) genome sequences are available for 13 GAS strains representing 10 emm types, all of which are assigned to the emm pattern A-C or E genotype (3, 7, 8, 28, 30, 32, 49, 51, 58, 61). As a first step toward gaining further insight into the genetic basis for tissue tropisms, a finished genome sequence was determined for the emm pattern D genotype strain, Alab49 (emm type 53). The genome of Alab49 is 1.827 Mb, which is close in size to the other finished GAS genomes; there are 1,804 predicted protein-coding genes. Despite being the sole emm pattern D strain among GAS with sequenced genomes, Alab49 has only 17 unique genes based on a combined 80% identity and 80% coverage threshold value applied to BLASTn results (data not shown).
This study specifically seeks to identify GAS noncore accessory genes that may play a critical role in establishing tissue site preferences for infection. To achieve this goal, the Alab49 reference strain plus 96 genetically diverse test strains were subject to comparative genome hybridization (CGH) to a GAS pangenome microarray. Represented among the set of 97 GAS strains is a wide spectrum of emm types and geographic places of isolation (see Table S1 in the supplemental material). Twelve isolates whose complete genome sequence is established are included; they also serve as controls for CGH. The set of 97 GAS strains is represented by 94 emm types and 95 sequence types (STs), defined by MLST of seven housekeeping loci (see Table S1); GAS strains were recovered from 18 countries and all six major continents over a time frame spanning >60 years. Both highly prevalent and rare emm types are included; each strain is given equal weight in the long-term evolution analysis that follows, because strains of emm types having low prevalence in recent decades may be the dwindling progeny of major clones of previous eras or alternatively, newly emerged or reemerged clones. Of the 97 GAS strains in the set, 20 (20.6%) have the emm pattern A-C genotype, 33 (34%) have the emm pattern D genotype, and 44 (45.4%) have the emm pattern E genotype (see Table S1).
The gene content of the 97 diverse GAS strains was determined. The amplicon-based microarray used for CGH gives complete coverage of the GAS pangenome, including targets corresponding to the newly sequenced genome of Alab49, plus multiple alleles of highly divergent loci. Of the 3,388 targets that were spotted onto the microarray, 2,440 (72%) are based on Alab49 sequences. Overall, 3,366 (99%) targets yielded CGH data that are reliable. A finding of weak or no hybridization of genomic DNA to the microarray signifies either gene absence or extensive sequence divergence. For Alab49-based targets, a mean hybridization signal ratio of >2.8 (i.e., present in the Alab49 strain hybridized as a reference and absent from the test strain) was used to define gene presence; for non-Alab49-based targets, a mean hybridization signal ratio of <0.5 (i.e., absent in Alab49, present in the test) was used to define gene absence (see Fig. S1 in the supplemental material).
An accessory gene that confers the tissue specificity phenotype via a mechanism used by many strains should be neither too rare nor too common in its distribution among the set of 97 GAS strains; thus, very-low- and very-high-frequency genes are excluded from further analysis. Spots on the microarray having a presence ranging from 15 to 85% among the set of 97 strains led to identification of 393 differentially distributed targets (for the data set, see Table S2 in the supplemental material). Thus, genes corresponding to the 393 targets are a subset of accessory genes and exclude genes that are either rare (present in <15% of strains) or nearly as common as core genes (present in >85% of strains).
CGH findings were validated using PCR amplification of 16 genes with a differential distribution among GAS strains. The CGH findings on the presence/absence of a gene are ≥94% concordant with PCR-based findings for 13 of 16 genes and ≥85% concordant for 15 of 16 genes (data not shown). Thus, the CGH and PCR-based findings are highly concordant, even though the oligonucleotide primers used in PCR amplification do not always overlap with the amplicon spotted onto the microarray, and several genes have high sequence heterogeneity. Although the CGH findings are highly reliable for most genes, the signal ratio threshold for assigning presence or absence may be too stringent for some genes. Nonetheless, the stringent cutoff values employed, as established in Fig. S1 in the supplemental material, reduce the number of accessory genes mistakenly assigned due to noise. In summary, 393 differentially distributed gene targets comprise the GAS accessory gene pool that fulfills the stated criteria for testing as a tissue tropism candidate.
Of the 393 differentially distributed targets identified by CGH, 164 (41.7%) match sequences within the Alab49 reference genome (for the data set, see Table S2 in the supplemental material). Figure 1 depicts the distribution of Alab49-derived genes among the 97 GAS strains under study; each concentric circle represents the CGH findings for an individual strain. Numerous regions of genetic diversity are evident, several of which correspond to prophage. The four prophage and prophage-like elements of the Alab49 genome account for 119 (72.6%) of the Alab49-based targets (see Table S3). Most remaining accessory gene targets matching Alab49 (n = 41) correspond to one of 11 nonprophage AGRs (Fig. 1; Table 1). Of the 229 differentially distributed targets that do not hybridize to Alab49, 132 match prophage genes in ≥1 of the sequenced GAS genomes or standalone transposases (for the data set, see Table S2). The remaining 97 targets correspond to 11 AGRs that are absent from Alab49 (Table 1). Overall, 22 AGRs were identified by CGH data analysis (Table 1). A linear map shows the relative position of the 22 AGRs of GAS (Fig. 2).
Figure S3 in the supplemental material depicts the gene content and arrangement of each of the 22 AGRs. Three AGRs have two distinct structural forms that are mutually exclusive (AGR-16A/16B, -17A/17B, and -18A/18B) (Table 1). At least five other AGRs having alternative forms (AGR-1/1X, -2/2X, -20/20X, -21/21X, and -22/22X) contain indels that arose via recombination of full-length genes. The emm pattern genotype is contained within AGR-21/21X (Table 1; also see Fig. S3-U). Only 6 of the 17 genes unique to Alab49 are in the subset of 393 differentially distributed targets; the remaining 11 genes unique to Alab49 are rare among the set of 97 GAS strains and unlikely to confer the skin tropism phenotype via a mechanism that is widely shared.
CGH findings for each of the 393 microarray targets corresponding to accessory genes underwent association analysis using a conservative statistical test for independence. Linkage disequilibrium (LD) measurements comparing strains having the emm pattern A-C versus D genotype reveals significant (P < 0.05) differences for 43 accessory gene targets (Fisher's exact test, two-tailed, BH method-corrected values); 21 of the 43 are highly significant (P < 0.01) (Table 2; also see the data set in Table S2 in the supplemental material).
Of the 43 accessory gene targets displaying significant negative associations between emm pattern genotype A-C versus D strains, 22 lie within nonprophage AGRs. Five are in the emm region (AGR-21/21X), encoding the transcriptional regulator Mga, the fibronectin (Fn)-binding protein FbaA, the collagen-like protein SclA, and two M or M-like proteins (see Fig. S3-U in the supplemental material). Seven genes that distinguish throat and skin specialist strains lie within AGR-2/2X (FCT region) (see Fig. S3-B in the supplemental material), which maps ~0.3 Mb from the emm region and contains genes for pilus biosynthesis (11, 26, 38). Ten negatively linked accessory genes reside within AGR-17A/17B; all 33 GAS strains of the emm pattern D genotype harbor AGR-17A (Table 2; also see the data set in Table S2 in the supplemental material), containing a putative magnesium transporter (CorA), NAD-dependent oxidoreductase (MocA), and a hypothetical protein (see Fig. S3-Q). The mutually exclusive AGR-17B constitutes a Cas-CRISPR region that is restricted to GAS strains of the emm pattern A-C or E genotypes. Of the 21 remaining accessory gene targets displaying strong negative associations between emm pattern genotype A-C versus D strains, all are associated with mobile genetic elements (i.e., prophage, transposases) or their remnants; based on gene annotation, none of the prophage genes displaying significant LD with the emm pattern A-C versus D genotype corresponds to a probable virulence factor. Taken together, the data indicate that accessory genes conferring tissue site specificity for throat versus skin infection most likely reside within AGR-2/2X, AGR-17A/17B, and/or AGR-21/21X (Table 2).
Most instances of statistically significant LD between accessory genes and emm pattern genotype involve the generalists (Table 2), which account for ~50% of isolates worldwide (see Fig. S2 in the supplemental material). Comparison of GAS strains of the emm pattern E genotype to each group of specialist strains reveals that 116 of the accessory gene targets exhibit statistically significant levels of LD (for the data set, see Table S2); of these, 39 targets correspond to prophage, phage remnants, or transposases. Seventy-six of the 116 negatively linked accessory gene targets are associated with 12 AGRs (Table 2). Nine of the 12 AGRs have 1 or more loci that display a significant difference (P < 0.05; Fisher's exact test with BH correction) in its distribution among emm pattern groups A-C versus E, whereas all 12 of the AGRs have one or more genes that are significantly different for the pattern D versus E genotypes. The most highly significant levels of LD (P < 1.00E−04) for emm pattern genotype A-C versus E strains are restricted to genes within AGR-21/21X (emm region). In contrast, the highest levels of statistically significant LD for emm pattern genotype D versus E strains (P < 1.00E−04) lie within six AGRs (for the data set, see Table S2); for the AGRs that do not have an alternate form, the AGR is present in most generalist strains and absent from the vast majority of skin specialist strains.
The finding that more than half of the AGRs display significant LD among the different emm genotype-defined groups of strains (Table 2) indicates that the frequency of many AGRs is highly skewed insofar as their distribution across the three ecological groups. Thus, it was of interest to define subpopulations of GAS based on AGR content and to determine the extent by which they correlate with the emm pattern-defined groupings. Each strain is assigned an AGR-based haplotype, whereby a single representative locus from each of 21 AGRs was converted to a binary character (presence or absence) based on CGH findings (for the data set, see Table S2 in the supplemental material); the character states are combined to yield the haplotype (see Table S4). Excluded from the haplotype is AGR-21, which contains the emm pattern marker. Among the 97 GAS strains, there are 96 unique AGR-based haplotypes of 21 characters, indicative of high diversity.
The AGR-based haplotypes were used to determine the population genetic structure of GAS, inferred using a Bayesian nonhierarchical clustering method implemented in STRUCTURE (version 2.3) (53). Because AGR-21 is not part of the haplotype, population assignments are independent of the biomarker for tissue site preference of infection. In the mixed ancestry (or admixture) model, each strain draws some fraction of its genome from each population. The number of populations is user defined; assuming two populations (K = 2) for the GAS data set of haplotypes, the vast majority of strains having the emm pattern D or E genotype are highly homogeneous and fall into discrete groupings (Fig. 3A). The cluster 2 population accounts for most of the fractional content of each emm pattern D strain (mean average, 0.869), whereas most of the fractional content (0.861) of pattern E strains is assigned to cluster 1 (Fig. 3B; also see Table S5 in the supplemental material). Overall, only 15.9% of strains having the pattern E genotype either fall into the pattern D dominant cluster or are >30% admixed; similarly, only 12.1% of strains having the pattern D genotype fall into the pattern E-like cluster or are heavily admixed.
In contrast to emm pattern D and E strains, strains having the emm pattern A-C genotype do not sort into a single dominant cluster (K = 2) (Fig. 3; also see Table S5 in the supplemental material). Slightly more than half (60%) are pattern D-like (e.g., emm types 3, 5, 6, and 18), and the other strains are either pattern E-like (e.g., emm types 1, 12, and 55) or resemble an admixture of the two populations. Although highly mixed, the throat specialist group has a somewhat closer resemblance to skin specialists than to generalists.
Varying the number of user-defined populations reveals different substructures. Increasing the number of populations to K = 3 yields two clusters (1 and 2) that together account for a very high fractional content of emm pattern D strains, whereas emm pattern E strains are largely assorted into the cluster 3 population (Fig. 3; also see Table S5 in the supplemental material). Once again, the fractional contents of emm pattern A-C strains are split among the clusters that dominate the emm pattern D and E groupings. Increasing the K value to 4 or 5 fails to resolve a discrete population for the emm pattern A-C strains, and instead there is increasing overlap with the emm pattern D and E profiles. Of note, increasing K from 2 to 5 (or higher) has minimal impact on the log-likelihood probability value. In summary, population genetic structure modeling provides evidence that strains with the emm pattern D and E genotypes comprise discrete AGR-defined populations, whereas the group of emm pattern A-C strains lack a cohesive population structure, and many individual strains overlap with the pattern D or E-associated subpopulations.
The 21 AGR binary character-based haplotypes used for population genetic structure analysis (Fig. 3) can provide input data for phylogenetic analysis. A phylogenetic network, rather than a tree, is a more appropriate model for evaluating evolutionary relationships among members of a highly recombined population. SplitsTree4 (33) provides a visualization of the incompatibilities between phylogenetic signals and was used to compute a phylogenetic network for the 97 GAS strains. The neighbor net method shows segregation of the vast majority of emm pattern D and E strains at opposite ends of the network (Fig. 4A). A minimum spanning network reveals perhaps an even sharper distinction between skin specialist and generalist strains (Fig. 4B). These findings imply that there are many shared evolutionary steps for most strains assigned to the emm pattern D grouping or to the emm pattern E grouping. Outlier strains—pattern D strains that are pattern E-like and vice versa—are the same strains that were outliers by STRUCTURE (Fig. 3A). In sharp contrast to the strong clustering observed for each of the emm pattern D and E groups, emm pattern A-C strains are widely scattered throughout the AGR-based networks (Fig. 4A and B). This finding is consistent with high levels of heterogeneity in AGR content among the throat specialist group and many nonoverlapping genetic pathways for the emergence of individual throat specialist strains.
The AGRs display an evolutionary dynamic that is distinct from that of the core housekeeping genes. Concatenated nucleotide sequences of seven core housekeeping genes from each of the 97 GAS strains reveal a relative lack of clustering based on emm pattern genotype (Fig. 4C), a finding that is consistent with other studies showing ample flow of housekeeping genes between the different emm genotypes (9, 36). Thus, against a background of extensive core housekeeping gene exchange, the strong associations among AGRs are most likely to arise as the consequence of strong coselection pressure, such as that encountered during adaptation to an ecological niche.
A key objective is to delineate the sequence of evolutionary events that underlie the preferred tissue site of infection phenotypes of GAS. The phylogenetic networks based on the presence/absence of AGRs (Fig. 4A and B) represent an evolutionary history involving a very large number of plausible ancestral states, making it very difficult to correlate the gain or loss of particular AGRs with the emergence of the ecologically distinct groups of GAS strains. To help untangle the evolutionary pathways of highly recombined bacteria, we present a new model for GAS evolution based only on the AGR pairs that have highly significant genetic associations. In this new model, recombination events that are far less likely to contribute to the critical adaptive steps are effectively masked.
As a first step, a comprehensive analysis of gene-to-gene interactions among GAS was undertaken, and the degree of association for all possible combinations of the 393 differentially distributed gene targets was calculated. Out of the >77,000 unique pairwise comparisons, 4,685 display statistically significant nonrandom associations (P < 0.05; Fisher's exact test, BH corrected); of the pairs of AGR-associated genes, 1,991 (43%) involve two different AGRs.
Next, the identity of genes exhibiting the strongest nonrandom associations was established. Figure 5 provides a summary of the smallest P value (BH corrected) for any pair of genes belonging to different AGRs. Of the 428 unique two-way comparisons among the 22 AGRs, including alternative forms, 163 AGR combinations have P values of <0.05 for at least one pair of genes. Genes belonging to five AGRs (AGR-2, -13, -14, -17, and -21) display extraordinarily strong associations with genes of a second AGR, yielding P values of <1.00E−07 (Fig. 5, yellow boxes); an additional three AGRs (AGR-3, -4, and -16) display corrected P values of <1.00E−05 (green boxes). The eight AGRs having a very strong association (P < 1.00E−05) with another AGR give rise to 15 unique pairwise combinations of different AGRs; for these 15 pairs, negative versus positive association was established (Fig. 5). For example, for the positively associated gene pair of AGR-17B and AGR-13, 79% of the 97 GAS strains either lack or harbor genes from both AGRs.
The possible role of coinheritance was assessed by evaluating the relative map positions of the eight AGRs having strong associations. Of the 15 unique pairs, only AGR-13 and AGR-14 lie adjacent on the GAS genome and lack an intervening AGR or prophage with a random distribution (Fig. 5); the shortest inter-AGR distance for any of the 15 strongly associated AGR pairs is 65.1 kb, between AGR-13 and AGR-14 (Fig. 2). Thus, if positively associated AGRs conferring an adaptive advantage were inherited via a single genetic event, subsequent recombination appears to have freely occurred in nearby nonadaptive regions without a loss of fitness, for at least 14 of the 15 strongly associated AGR pairs. Therefore, one can conclude that nearly all pairs of strongly associated AGRs are subject to strong coselection pressures.
To mask genetic changes unlikely to contribute to niche adaptations, modified haplotypes based on highly linked genes only are used. For each of the eight AGRs displaying a very high degree of association (P < 1.00E−05) (Fig. 5), the locus within each AGR having the lowest P value when paired with any gene of another AGR was chosen for input data (see Table S6 in the supplemental material). Among the 97 GAS strains, there are 34 haplotypes defined by the presence or absence of genes of the eight strongly associated AGRs (see Table S7); 14 haplotypes are represented by multiple strains.
Haplotypes based on the strongly associated AGRs provide input data for the new evolutionary model. Figure 6 reveals that 26 of the 34 haplotypes form a large cluster whereby every haplotype differs from ≥1 other haplotype by one genetic step. Overall, there are 31 possible connections (i.e., single locus variants) between haplotypes belonging to the major cluster. Within the major cluster, nearly all skin specialist strains (green) form a tight genetic network that is discrete from a second network dominated by generalist strains (red); a single haplotype (H24) provides the only connection between the two major subclusters (I, upper subcluster and II, lower subcluster). The data showing a division between skin specialists and generalists strongly parallel the findings obtained via STRUCTURE and SplitTrees analysis (Fig. 3 and and4),4), which are based on all AGRs except AGR-21, which harbors the emm pattern genotype.
Throat specialist haplotypes (blue) are preferentially associated with the skin specialist subcluster I of the evolutionary network (Fig. 6). However, a pattern A-C emm type (emm1) that accounts for a substantial fraction of pharyngitis cases in recent decades (59) belongs to a haplotype pair (H8/H12) that differs from those of the main network by ≥2 of the 8 AGRs (see Table S7 in the supplemental material). Another highly prevalent pattern A-C emm type (emm12) is part of subcluster II (H11) (Fig. 6). Overall, emm pattern A-C genotypes are highly diverse in AGR content, which is suggestive of multiple genetic pathways associated with adaptation to their narrow niche.
A striking feature of the evolutionary network (Fig. 6) is the transition between skin specialists and numerous throat specialist strains within subcluster I, involving haplotypes H17 and H19. Haplotypes H17 and H19 differ in AGR-2/2X (FCT-region) content, whereby the haplotype-defining gene (prtF1) encodes a fibronectin (Fn)-binding protein (see Table S6 in the supplemental material). The data are consistent with the notion that gain or loss of prtF1 is a key genetic step that facilitates the exploitation of a new niche by many tissue specialist strains.
Several differentially distributed genes encoding Fn-binding proteins are present in GAS, and all lie within AGR-2/2X (prtF1, prtF2) or AGR-21/21X (fbaA, sfbX, sof). The haplotypes of the strains in the skin specialist-dominant subcluster I were further refined by incorporating the character state (presence or absence) of the five Fn-binding protein genes into the haplotype definition (Fig. 7; also see Table S7 in the supplemental material). Although the direction of the ancestor-descendant relationships is difficult to ascertain, transitions between throat and skin specialists are marked by changes in Fn-binding gene content, whereby throat specialists tend to have prtF1, and skin specialists tend to have fbaA. Thus, changes in the content of Fn-binding protein genes, or in closely mapping loci, strongly correlate with fundamental niche adaptations by GAS.
A hallmark feature of GAS is its extensive history of horizontal transfer of core genes, posing challenges for constructing a model for evolutionary descent of strains within this species (9, 27, 31, 36, 48, 65, 66). Yet, against a highly random genetic background, adaptations to key ecological niches appear to be highly stable. The genetic basis for the ecological phenotypes for preferred sites of infection most likely lie with a unique combination of genes, small indels, point mutations, and/or gene arrangements. In this report, three accessory gene regions (AGR-2/2X, -17A/17B, and -21/21X) are identified as displaying strong associations with throat versus skin strains.
Several genes that lie within AGR-2 and AGR-21 of Alab49 and display strong LD with emm pattern A-C versus D strains (for the data set, see Table S2 in the supplemental material) also play a critical role in virulence at the skin when GAS strains with targeted gene deletions are tested in a humanized mouse model for impetigo (12, 43, 45, 46, 62). Essential virulence factors include regulators of transcription (Mga, Nra) and extracellular proteins (M53/PAM, Ska, Cpa, PrtF2). However, an Alab49 mutant with a deletion in AGR-17A (ΔcorA) exhibits no detectable loss of virulence in the humanized mouse model for skin infection (data not shown). The current data indicate that AGR-2 and/or AGR-21 may account for all accessory genes that influence tissue site preferences for primary GAS infection. A direct role for FbaA (AGR-21) in skin infection remains to be established but can be accomplished by showing a loss of virulence following the deletion of fbaA in a skin specialist strain or a gain of virulence following fbaA addition to a throat specialist strain. The strong associations of global transcriptional regulators within AGR-2/2X and AGR-21/21X raise the possibility that skin and throat strain specialists differ in their transcriptomes.
AGR-21 and AGR-2 lie 273 kb apart in the Alab49 genome. In the closely related streptococcal species S. agalactiae, large segments of DNA (up to 334 kb) have undergone horizontal transfer and chromosomal replacement (15). Recombination involving a foreign DNA fragment of that size is undocumented for GAS, but if it occurs, it might lead to the acquisition of multiple AGRs via a single genetic event. However, the prophage and AGRs lying between AGR-21/21X and AGR-2/2X have a highly variable distribution across GAS strains. Thus, the numerous strong associations between individual genes within AGR-2/2X and AGR-21/21X, as revealed through the comprehensive analysis of gene-to-gene interactions (Fig. 5; data not shown), seem to arise through a selection pressure that drives adaptation.
Secondary transmission of GAS from an impetigo lesion to the upper respiratory tract is a well-established phenomenon and typically leads to a throat carrier state whereby the organism fails to cause symptoms of disease (2, 14). However, the throat carrier state may be an evolutionary dead end for skin specialist strains, because transmission to the superficial skin from the throat, or to a new host via a respiratory route, is relatively rare (2, 14). Thus, the evolutionary transition between disease-causing skin and throat specialists appears to require at least two important phenotypic shifts: gain/loss of highly effective transmission via a respiratory route and gain/loss of the ability to cause superficial skin infection. The emergence of some throat specialists (i.e., subcluster I) (Fig. 7) from skin specialists (or vice versa) may provide a clue as to the molecular basis for these phenotypic shifts. H17y includes emm types associated with highly prevalent throat strains (emm types 3, 5, 18) (59), and therefore fbaA and/or other AGR-21/21X-linked genes may be the most critical for conferring the adaptive shift within subcluster I.
The two major subclusters dominated by skin specialists and generalists are connected via a weak link that includes only the H24 strain (Fig. 6), which has an emm type (emmstNS90) that has been rarely recovered in recent surveys (59). If H24 represents the progenitor of both the skin specialist and generalist subclusters, it appears to be close to extinction. H24 differs from the connected haplotypes of subcluster I and II in AGR-14 and AGR-21/21X, as well as in its content of Fn-binding protein genes within AGR-2/2X (see Table S7 in the supplemental material). The distant genome map positions for the three AGRs (Fig. 2) signify that it is highly unlikely that the transition between subcluster I and II involved a single genetic event. As with the transition between throat and skin specialist haplotypes within subcluster I, change in Fn-binding protein gene content is associated with the shift between the major skin specialist and generalist subclusters.
The population genetic structure of GAS reveals a history of extensive recombination when core genes are considered (9, 20, 27, 31, 36, 48, 65). By definition, AGRs are the products of horizontal transfer and/or differential genome reduction. Yet, the AGR content of GAS strains is highly uneven and reveals discrete population substructures corresponding to the skin specialists and generalists. The relative homogeneity of these two ecological groups stands in sharp contrast to the throat specialists. These findings lead to the two major conclusions of this report: that strong associations between AGRs arise due to strong selection pressures that drive niche adaptations, and that there exist multiple molecular strategies for effective respiratory transmission and oropharyngeal infection.
We thank Andrew Steer for providing detailed emm type data on published epidemiological studies and Sean C. Daugherty for help with annotation and data management. We are grateful to the sequencing, assembly, and closure teams at The Institute for Genomic Research for genome data processing.
This work was supported by National Institutes of Health grants AI-065572 (to D.E.B. and H.T.) and AI-053826 (to D.E.B.).
#Supplemental material for this article may be found at http://jb.asm.org/.
Published ahead of print on 23 September 2011.