|Home | About | Journals | Submit | Contact Us | Français|
The widespread presence of antibiotic resistance and virulence among Staphylococcus isolates has been attributed in part to lateral genetic transfer (LGT), but little is known about the broader extent of LGT within this genus. Here we report the first systematic study of the modularity of genetic transfer among 13 Staphylococcus genomes covering four distinct named species. Using a topology-based phylogenetic approach, we found, among 1,354 sets of homologous genes examined, strong evidence of LGT in 368 (27.1%) gene sets, and weaker evidence in another 259 (19.1%). Within-gene and whole-gene transfer contribute almost equally to the topological discordance of these gene sets against a reference phylogeny. Comparing genetic transfer in single-copy and in multicopy gene sets, we observed a higher frequency of LGT in the latter, and a substantial functional bias in cases of whole-gene transfer (little such bias was observed in cases of fragmentary genetic transfer). We found evidence that lateral transfer, particularly of entire genes, impacts not only functions related to antibiotic, drug, and heavy-metal resistance, as well as membrane transport, but also core informational and metabolic functions not associated with mobile elements. Although patterns of sequence similarity support the cohesion of recognized species, LGT within S. aureus appears frequently to disrupt clonal complexes. Our results demonstrate that LGT and gene duplication play important parts in functional innovation in staphylococcal genomes.
Staphylococci are nonmotile but invasive Gram-positive bacteria that are associated with various pus-forming diseases in humans and other animals. The most prominent pathogenic species in the genus is Staphylococcus aureus, of which various strains colonize the nasal passages and skin in humans, causing illnesses that range from minor skin lesions or infections to life-threatening diseases, e.g., meningitis, septicemia (bacteremia), and toxic shock syndrome (55, 67, 94). The other species of Staphylococcus, although lacking genes that encode virulence factors and toxins, are opportunistic pathogens for immunocompromised patients (2, 74, 82).
One of the major problems in the prognosis of staphylococcal infections is the progressive development of resistance in the Staphylococcus species to multiple antibiotics (14), e.g., methicillin (36, 102) and vancomycin (23, 84), which has, as in other pathogenic bacteria, been attributed to the susceptibility of the organisms to genetic transfer (22, 83). Lateral genetic transfer (LGT) occurs when the organisms acquire exogenous genetic material that encodes antibiotic resistance (6, 105), and this material becomes established in the lineage, whether by recombination into the genome or, potentially less stably, on an extrachromosomal genetic element (31, 86). In Staphylococcus, transfer of genetic material has been shown to be mediated by phage transduction and conjugative transfer (19, 62, 79).
Although a number of studies have examined the frequency of LGT in prokaryotes using rigorous phylogenetic approaches (8, 63, 64, 85, 108), studies of LGT in Staphylococcus (11) have been limited. An early analysis based on sequence similarity searches (60) and more-recent genome comparative studies (44, 66) on S. aureus have revealed pathogenicity-related and/or extrachromosomal genes (i.e., genomic elements present externally to the chromosome) that are likely to have been acquired via LGT. These mobile genetic elements (MGEs) (65, 66, 69) include pathogenicity islands, plasmids, transposons, and the staphylococcal cassette genomes (SCCs). Many of these genes are associated with virulence, e.g., “superantigen” genes implicated in toxic shock and food poisoning (80), or with antibiotic resistance, e.g., the SCCmec elements that encode methicillin resistance in S. aureus (50, 77), suggesting that genetic transfer is a key contributing factor to the evolution of virulence and antibiotic resistance in these species. Based on allelic profiles derived from seven highly conserved housekeeping genes in various S. aureus isolates, recent multilocus sequence typing (MLST) studies (28, 31), while grouping diverse isolates into multiple clonal complexes (CCs; http://saureus.mlst.net/), have also shown that chromosomal genes of S. aureus are inherited largely vertically, i.e., parent to offspring within a lineage. However, little is known about the extent of gene sharing within and between these CCs of S. aureus.
Although recombination has been extensively studied in bacteria (30, 73), large-scale phylogenetic (phylogenomic) studies are typically based on the assumption that the units of genetic transfer are necessarily whole genes. It has been shown, however, that genetic material integrated via LGT can constitute an entire gene (41, 104), a partial (fragmentary) gene (10, 49, 75), or multiple (entire or fragmentary) adjacent genes, including multigene operons (34, 48, 81). We recently examined the transfer of genes and gene fragments among 144 prokaryote genomes (15, 18), demonstrating that studies focusing entirely on whole-gene transfer are likely to underestimate the extent of LGT. In the previous analyses, we included only single-copy gene sets to avoid complications of paralogy in the inference of LGT. Using a more-focused data set than in our previous studies, here we apply a statistically based phylogenetic approach to examine the modularity of LGT in 13 completely sequenced genomes of Staphylococcus (both chromosomes and extrachromosomal elements; Table 1 ), this time also including gene sets with two or more members in individual genomes.
The genomes of S. aureus subsp. aureus N315 and Mu50 were the first of any Staphylococcus spp. to be sequenced and released (60); both strains were isolated from male patients with postsurgical wound infections in Japan. The strains MW2 (4) and MSSA476 (45) are both isolates of community-acquired S. aureus infections; the strain COL (36) is the oldest isolate of methicillin-resistant S. aureus (MRSA) and dates back to 1976. The strain USA300 (24) was isolated from unassociated outbreaks of S. aureus infections in the United States, Canada, and Europe within the last decade. Finally, the strain NCTC8325 (47) has been used as the generic representative strain of S. aureus in most genetic studies. The S. aureus strain MRSA252 (45) is a hospital-acquired MRSA, while RF122 (42) is a common strain associated with mastitis diseases in cattle.
The other three Staphylococcus species—S. epidermidis, S. haemolyticus, and S. saprophyticus—are nonvirulent but have been associated with a number of opportunistic infections in immunocompromised patients. S. epidermidis strain RP62A (36) is a biofilm-forming strain that can cause toxic shock syndrome and scarlet fever, whereas strain ATCC 1228 (107) of the same species is a non-biofilm-forming, nonpathogenic strain that has been used for detecting residual antibiotics in food products. S. haemolyticus (strain JCSC1435) (100) and S. saprophyticus (strain ATCC 15305) (61) are generally opportunistic pathogens; S. haemolyticus infrequently causes soft tissue infections, and S. saprophyticus is predominantly implicated in genitourinary tract infections.
Multiple homologs within a genome can be interpreted as paralogs (a result of within-genome duplication) or xenologs (arising from acquisition of a gene copy from an external source via LGT) (33, 39, 58). Since we do not have prior knowledge on their origins, we follow Lerat et al. (64) in referring to these multiple homologs as “synologs.” We characterize the frequencies of within- and whole-gene transfer in gene sets that contain no synolog and in those that contain one or more synologs and discuss correlations with annotated gene functions. This represents the first systematic study of the transfer of genes and gene fragments in any prokaryotic genus, in which duplicated gene copies are also considered. Given that sequenced genomes can be assigned to established CCs in S. aureus, we also examined the extent to which LGT can disrupt established CCs.
Thirteen completely sequenced genomes of Staphylococcus spp. were downloaded from GenBank (http://www.ncbi.nlm.nih.gov/); their accession numbers and putative CCs are presented in Table 1. These genomes represent four distinct species: S. aureus (nine isolates), S. epidermidis (two isolates), S. haemolyticus (one isolate), and S. saprophyticus (one isolate).
The 34,066 protein sequences predicted from these genomes were clustered into 2,924 sets based on similarity matching via a Markov clustering algorithm (inflation parameter 1.1) (27). These sets consist of putatively homologous protein sequences; some are encoded by one or more copies of genes within a single genome, i.e., synologs. Multiple sequence alignment was performed on each protein set using T-COFFEE (78), with four combinations of the penalties for gap opening (25 and 50) and gap extension (0 and 10), and MUSCLE (26), with the default settings. The alignments were validated by using a pattern-centric objective function (7); the alignment receiving the highest score according to the objective function was selected as optimal for each protein set.
The protein alignments were converted into nucleotide sequence alignments, with nucleotide triplets arranged to parallel exactly the protein alignment in each case. For these nucleotide sequence alignments, we require gene set sizes of ≥4, because 4 is the minimum size that can yield distinct topologies if every sequence in the set is unique. Where identical nucleotide sequences were present in a set, we removed (at random) all but one copy, after which gene sets with sizes of <4 (1,493; 51.1%) were again excluded. Almost all instances of the identical copies removed from the data set (99.9% of 7,356 sequences) represented organisms annotated as the same species or strain; therefore, the effect of this step in altering final tree topologies among distinct genome isolates (in subsequent analysis [see below]) is negligible. In addition, we excluded gene sets with sizes of >52 (77; 2.6%) for computational reasons: Bayesian phylogenetic inference is computationally very demanding for large sequence sets (56 of which, i.e., 1.9%, have sizes of >70), and interpretation of the resulting tree is often problematic owing to poor convergence and the multiplicity of possible topology resolution paths. In this way, the data set used subsequently in the present study was reduced to 1,354 sets (4 ≤ N ≤ 52) corresponding to 13,297 genes (39.0% of all genes annotated in these genomes; the representation of the genomes in these gene sets is shown in Table 1). Each of the 13 genomes, if equally represented in a gene set of size 52, can have a maximum of four gene copies. The overall pairwise nucleotide sequence similarity across all gene sets ranges from 0.47 to 0.99, with a mean of 0.74 and standard deviation of 0.12.
As a reference phylogenetic tree for the 13 Staphylococcus isolates, we computed a supertree using the matrix representation with parsimony (MRP) approach (87). Hybrid clustering (40) of all proteins based on their pairwise distances (BLASTP; e-value ≤ 10−3) among 34,066 protein-coding sequences yielded 2,645 putatively orthologous protein sets (maximally representative clusters) (8). Sequences in each set were aligned as described above and phylogenetic trees were inferred using MrBayes (89) with an MCMC chain length of 2,500,000, of which the first 500,000 generations, comprising 5,001 sampled trees, were discarded; the K2P nucleotide substitution model (56); and a four-category approximation to the gamma distribution for among-site rate variation (106). The MRP matrix was generated from these trees using CLANN version 2.0.1 (21), and the MRP supertree was generated using PAUP* version 4.0 (98).
We define a recombination breakpoint to be a boundary of a genetic region introduced by an LGT event and incorporated via recombination into a genome. If a recombination breakpoint is detected within the boundaries of a gene (here, represented by its annotated open reading frame) using our approach, we refer to that breakpoint as an observable recombination breakpoint (ORB) and classify the corresponding set of homologous genes as ORB+; these gene sets indicate lateral transfer of a fragment of one or more genes. In contrast, a gene set lacking a detectable internal recombination breakpoint is classified as ORB−. An ORB− gene set that has an incongruent phylogeny to a reference species tree indicates lateral transfer of the whole gene and possibly also of genomic sequence extending beyond that gene (see sections below). More detail on the classification of ORB+ and ORB− gene sets is given in the study by Chan et al. (15).
We used a two-phase strategy (17) for detecting recombination within each gene set. Three P value statistics—the maximal chi-squared value (71), the neighbor similarity score (52), and the pairwise homoplasy index, as implemented in PhiPack (12)—were first used to detect evidence of recombination events within the sequence sets based on discrepancies in phylogenetic signals. Datasets in which at least two of the three P values were ≤0.10 were considered as potentially showing evidence of fragmentary LGT. To those gene sets, we applied DualBrothers (72) to define the ORBs more precisely. The program was run with an MCMC chain length of 2,500,000 generations, with a burn-in phase of 500,000 generations. The size of the window around which existing change points are randomly moved during any update was set to 5. Peter Green's constant C, the proportion of time that the sampler spends on updating the parameters when the number of dimensions is fixed, was set to 0.25. The set of trees (the tree search space) considered by the DualBrothers MCMC sampler (i.e., the list of all possible phylogenetic trees that could be inferred from shorter segments within the sequence set) was determined separately for each gene set as follows. Partitions (windows) were progressively stepped across each alignment (window length = 100, step size = 50; unit in alignment position) and the Bayesian phylogenetic inference program MrBayes (89), as described in the previous section, was used to infer unrooted phylogenetic trees at each window position. Inferred trees within the threshold at a Bayesian confidence interval of 90% were included in the initial tree list, with the maximum number of trees included set to 1,000. Gene sets containing ORBs are inferred to have undergone one or more within-gene transfer events. (See Method S1 in reference 18 for more information regarding the generation of tree search space for the implementation of DualBrothers.)
For each gene set for which no evidence of recombination was found in the first-phase screening, and for those positive in the first-phase screening but for which no recombination breakpoint was found by DualBrothers, we inferred a Bayesian phylogenetic tree and compared its topology against that of the species reference supertree using a multiple-test approach, as described in reference 15, incorporating the statistical tests of Shimodaira and Hasegawa (93), Goldman et al. (37), and Kishino and Hasegawa (57), as well as the expected likelihood weights (97), all as implemented in Tree-Puzzle 5.1 (92). Whole-gene (nonfragmentary) genetic transfer was inferred if any topology was rejected by two or more tests (P ≤ 0.05), indicating discordance with the reference topology. The approach for analysis of genetic transfer in gene sets that contain one or more synologs was adopted from an earlier study (64). Each gene set containing m genes from a total of n distinct genomes was categorized into three separate classes: (i) those with no synolog, m = n; (ii) those with one or two synologs, n < m ≤ n + 2; and (iii) those with more than two synologs, m > n + 2. For class i gene sets with no synologs, topological discordance between a gene tree and the reference supertree implies a whole-gene transfer via LGT. For each gene set in class ii, two or four new alignment sets were generated by removing replicate synologs in all combinations, followed by tree inference and topological comparison. Discordance between one or more of these gene trees and the reference supertree implies acquisition of a synolog in the gene set via LGT. For class iii gene sets with more than two synologs, the generation and analysis of dereplicated trees in all possible combinations quickly becomes impractical. For these sets, we compared the gene tree directly to the reference supertree; topological discordance suggests whole-gene recombination and might be explained by one or more events (see Results for details).
Individual gene-set trees were inferred from DNA alignments using MrBayes (89) with the parameters described above. The topology of each gene set was compared against that of the reference supertree using the multiple-test approach as described above and in reference 15. Such discordance was taken as prima facie evidence of whole-gene (nonfragmentary) genetic transfer.
Functional analysis of gene sets was based on role identifiers (Mainrole) as retrieved from the comprehensive microbial resource at The J. Craig Venter Institute (JCVI) website (http://cmr.jcvi.org/). Over- or under-representation of functional categories was based on the probability of observing a defined number of target groups (or categories) in a subsample, given a process of sampling without replacement from the whole data set (as defined in each case [see the text]) under a hypergeometric distribution (53). The probability of observing x number of a particular target category is given by the following equation:
where N is the total population size, m is the size of the target category within the population, n is the total size of the subsample, and k is the size of the target category within the subsample. The function for individual gene set was annotated based on sequence similarity matches (BLASTP; e-value ≤ 10−5) against the NCBI nr protein database, after which the annotations were extracted from the Gene Ontology database (http://www.geneontology.org/) using Blast2GO (38).
Protein domain and boundary information for each protein in the data set (N = 13,297) was determined by sequence similarity search against domain entries (type = “domain”) in Pfam version 20 (32) and SCOP version 1.69 (1). The breakpoint-to-boundary distance ρ follows the definition in our previously published work (see Fig. 1B in reference 18) as the normalized distance from an inferred recombination breakpoint (i.e., the ORB) to the midpoint of the nearest annotated domain boundary and ranges between 0 and 1. A ρ value of ≈0 indicates that the ORB is located at the midpoint of a domain-coding gene region (“domon”; see reference 18), i.e., domon disruption by the ORB, whereas ρ ≈ 1 indicates the ORB is located at or outside the domon boundary (i.e., no domon disruption by the ORB). To examine potential tendency of domon preservation during LGT (observed as large ρ values), we subsampled the data set randomly 10,000 times (omitting ρ = 1, size of each subsample = 50; to adjust for inference bias due to large sample size) and compared each subsample to a uniform distribution of (0, 1) using the Kolmogorov-Smirnov test (70).
For each of the 34,066 proteins in our data set, we searched for the most-similar sequence (or sequences, where more than one sequence is equally most similar) in other staphylococcal isolates using BLASTP (13) against a comprehensive database comprising 212,808 protein sequences from all available 81 genomes of Staphylococcus spp. (19 complete and 62 draft genomes as of 15 March 2011 in NCBI GenBank: see Table S1 in the supplemental material). Only top hit(s) with an e-value of ≤10−100 were counted. We use the term “genome affinity” to describe the relative recency of divergence (whether via vertical descent or LGT), measured as the number of matches between genomes from the same or different species, or for S. aureus the same or different CCs or species.
We adopted the multilocus sequence typing (MLST) approach to assign CC for each of the 63 S. aureus isolates in the protein database (see Table S1 in the supplemental material). For each isolate, we identified (where possible) the housekeeping gene sequences encoding carbamate kinase (arcC), shikimate dehydrogenase (aroE), glycerol kinase (glpF), guanylate kinase (gmk), phosphate acetyltransferase (pta), triosephosphate isomerase (tpi), and acetyl coenzyme A acetyltransferase (yqiL). The seven-gene allelic profile and hence the sequence type (ST) for each S. aureus isolate were identified using BLASTN (13) (e-value ≤ 10−100) against all gene alleles available from http://saureus.mlst.net/ (217 arcC alleles, 289 aroE alleles, 259 glpF alleles, 156 gmk alleles, 217 pta alleles, 225 tpi alleles, and 223 yqiL alleles as of 18 March 2011). The CC for each of the distinct STs was assigned using eBURST v3 (29) at default settings (http://saureus.mlst.net/). See File S1 in the supplemental material for the complete allelic profiles and STs for the 63 S. aureus isolates, the distinct STs used for eBURST analysis, and the resulting output.
Of the initial 2,924 clustered protein sets, 77 (2.6%) have a size of >52 and on that basis were excluded from analysis of LGT. Many of these large sets consist of proteins related to transport functions; the largest is composed of 1,777 proteins (constituting 5.2% of all annotated protein-coding genes annotated in Staphylococcus) related to the ATP-binding cassette (ABC) transporter. ABC transporters constitute one of the largest sets more generally in prokaryotes (43). The second-largest set is a 769-member cluster of proteins related to the phosphotransferase system (PTS) that are involved in sugar phosphorylation and regulation of metabolic processes (90). To analyze modularity of genetic transfer (at gene instead of protein level) in these genomes, we recovered protein-coding gene sets that correspond to the protein sets. By imposing a size restriction (4 ≤ N ≤ 52) and removing identical sequences (see Materials and Methods), we retained a total of 1,354 protein-coding gene sets in the subsequent analysis (4 ≤ N ≤ 41; no gene set fell in the range 42 ≤ N ≤ 52). The representation of each genome among the 1,354 gene sets and the size distribution for these sets are shown in Table 1 (far-right column) and Fig. 1, respectively. The most virulent species, S. aureus, is well represented, with all but two S. aureus genomes represented in over 1,200 gene sets. In total, for 229 (15.7%) gene sets, N = 8, and for 197 (14.5%), N = 9. The largest set consists of 41 sequences that encode putative nucleotidase proteins, while the second-largest set has 39 putative proteins of guanosine- or inositol-monophosphate dehydrogenase.
We inferred genetic transfer differently in single- versus multicopy gene sets. Single-copy (SC) gene sets contain no within-genome duplicated genes, i.e., no genome is represented more than once in such a set. In the absence of evidence to the contrary, the most parsimonious interpretation is that these genes share an evolutionary origin that predates the divergence of these strains, i.e., that these genes are orthologs. On the other hand, multicopy (MC) gene sets contain one or more within-genome duplicates, with at least one genome represented more than once in such a set. The additional gene copy (it is neither necessary, nor indeed usually possible, to distinguish the “original” from the “duplicate”) must either have arisen via a within-genome duplication event (i.e., be a paralog) or have been imported via LGT into the genome (i.e., be a xenolog). The presence or absence of these synologs necessarily affects how we interpret evidence of genetic transfer, as discussed below.
We interpret the evidence of phylogenetic discrepancies (manifested as topological discordance) and the inference of one or more observed recombination breakpoints, i.e., ORB(s), within the boundaries of the gene set as within-gene (fragmentary) genetic transfer (i.e., instances of ORB+; see reference 15). Table 2 shows the possible interpretations of fragmentary transfer in SC and MC gene sets. For those sets in which we find no ORB (instances of ORB−), we infer that there has been no fragmentary genetic transfer (SC-noFragGT and MC-noFragGT); in the latter case (MC), the provenance of the synologs is unclear. Inference of ORBs within SC gene sets (SC-FragGT) can be interpreted (again, in the absence of evidence to the contrary) as LGT between orthologs. In MC gene sets, however, the situation is more complex: (i) if LGT is inferred within the set but topological incongruence with the reference tree is restricted to single-copy genes (i.e., recombination has not affected the synologs), then all synologs are paralogs (MC-FragGT-P), and (ii) where both recombinant (xenologous) and nonrecombinant (paralogous) regions of synologs from the same genome are present in a gene set (the recombinant synologs are more precisely paralog-xenolog chimeras), we denote this situation in the set by MC-FragGT-P/X. Where per-genome copy number is small, it is reasonable to assume (in the absence of evidence to the contrary) that the LGT event occurred subsequently to the within-genome duplication. If, however, (iii) all synologs are detected as recombinant, we know that each is a xenolog (again, more precisely, a paralog-xenolog chimera), but it may be impossible to reconstruct the precise order of within-genome duplication, LGT and perhaps other (e.g., gene conversion or lineage-sorting) events that have produced this situation, which we denote as MC-FragGT-PX.
The interpretation is much the same in the case of whole-gene transfer, as shown in Table 3 for SC and MC gene sets, although with a complication (discussed below) that arises from our methodological approach. In the case of SC gene sets, both alternatives (SC-noWholeGT and SC-WholeGT) exactly parallel those for within-gene transfer (Table 2), with whole-gene transfer in SC gene sets again interpreted as involving orthologs. In MC gene sets, the situation is again more complex: (i) if LGT is inferred within the set but the synologs are not implicated, then all synologs are paralogs (MC-WholeGT-P), but (ii) where some but not all of the synologs are recombinant, the nonrecombinant synologs are native paralogs but the recombinants are xenologs (not chimeras, since these genes have been transferred in their entirety), and we denote this situation in the set by MC-WholeGT-P/X with the same qualification as in the case of within-gene transfer. If (iii) all synologs are recombinant, we infer that each is a xenolog throughout its coding region, although as before it may be impossible to reconstruct the precise order of within-genome duplication, LGT and perhaps other events that have produced this situation, which we denote as MC-WholeGT-PX.
The complication, mentioned above, concerns the evidentiary basis on which we identify none, some, or all synologs as recombinant via whole-gene transfer. In our approach, we inferred an ORB as a point at which phylogenetic trees inferred for the sequence regions to the immediate left and right are in topological disagreement. These trees can be recovered and compared to the reference supertree, making it straightforward to identify recombinant region(s) within a gene. Where no ORB can be identified, it is necessary to infer trees in a separate step for comparison against the reference topology. As a general rule we carried this out as just described, i.e., by using the synologous gene set as input into the Bayesian inference software MrBayes (89). Two drawbacks of this approach are the resource (CPU and memory) demands associated with inference from large datasets, and the complexity of extracting and interpreting topology data for all minimal subtrees that include synologs, particularly when taking into account the relative support for relevant bipartitions. We realized that there is an opportunity to address both of these drawbacks for the subset of sets (ca. 40% [see below]) that contain only a few (in the present case, one or two) synologs. For each such gene set, we dereplicated synologs in all combinations, yielding a set of alignments in which each genome is represented only once (see Fig. 2 for an example). Then, from each alignment we inferred a Bayesian phylogenetic tree. Since each gene set contains either one or two synologs, for each we generate two to four trees, each of which we compare separately against the reference supertree. In this way we easily automated the extraction of the relevant congruence relationship. This approach, adapted from that of an earlier study (64), could in principle be extended to greater numbers of synologs per set, although at exponential cost unless additional filtering steps are implemented.
We applied a two-phase strategy (17) to detect recombination in each of the 1,354 gene sets, the first phase involving three statistical tests for inferring phylogenetic discrepancies and the second involving a Bayesian approach to more-accurately identify ORBs (see Materials and Methods). Of the 1,354 gene sets, we found 401 (29.6%) that show evidence of within-gene recombination (i.e., sets containing one or more ORBs) after first-phase screening for phylogenetic discrepancies. Of these sets, 252 (18.6% of 1,354) also showed clear evidence of recombination based on Bayesian phylogenetic analysis in the second phase, with Bayesian posterior probability (BPP) support of ≥0.500 for the dominant topology (as determined internally with respect to the individual alignment) on at least one side of the inferred breakpoint in each set. These are the ORB+ gene sets. For 68 sets (5.0%), we could identify an ORB, but no sequence region has a BPP of ≥0.500; we labeled these as inconclusive. For a further 81 gene sets (6.0%), recombination was detected in the first phase, but no ORB could be identified in the second phase (i.e., these are the false positives from the first phase). No evidence of recombination was detected in 953 (70.4%) gene sets in first-phase screening.
Of the 252 ORB+ gene sets that show clear evidence of within-gene (fragmentary) genetic transfer, 98 are single-copy gene sets (SC-FragGT), while the other 154 are multicopy gene sets (MC-FragGT). Phylogenetic discrepancy at different regions across a gene set alignment can sometimes be due to regional (e.g., domain-specific) differences in rates of nucleotide substitution in one or more sequences. We looked specifically for such differential rates (μ ≥ 0.30, from DualBrothers) but did not observe any instances that span the inferred breakpoints. This suggests that the breakpoints inferred here indeed arise from genetic recombination and are not artifacts of variation in the rate of nucleotide substitution.
To examine possible functional bias pertaining to fragmentary genetic transfer within the 252 sets, we used annotations from the JCVI Comprehensive Microbial Resource (http://cmr.jcvi.org/) to assign a functional category (Mainrole) to each gene set. Figure 3 shows the proportions of proteins in each functional category for (Fig. 3a) single-copy (SC-FragGT) and (Fig. 3b) multicopy (MC-FragGT) gene sets for which we inferred fragmentary genetic transfer, compared to their frequencies in the full (1,354-set) staphylococcal data set.
Gene sets affected by fragmentary genetic transfer are significantly either over- or under-represented in more than half of the JCVI functional categories, and this is the case for both SC and MC gene sets. Gene sets affected by fragmentary transfer are significantly over-represented, compared to expectation, in various categories (energy metabolism; DNA metabolism; protein fate; amino acid biosynthesis; purines, pyrimidines, nucleosides, and nucleotides; and transcription) for both SC-FragGT and MC-FragGT. On the other hand, both types of gene sets are significantly under-represented in the categories hypothetical protein, transport and binding proteins, and regulatory functions. The category protein synthesis is over-represented in SC-FragGT but under-represented in MC-FragGT, while the unknown function category is biased in the opposite direction.
Of the 1,354 staphylococcal gene sets under consideration, we have thus far inferred within-gene transfer for 320 (252 as clear instances of LGT and a further 68 for which evidence was deemed inconclusive). We now turn to the 1,034 remaining gene sets: 953 recombination-negative plus 81 false-positive cases resulting from our two-phase approach for detecting within-gene transfer. For each we inferred a Bayesian phylogenetic tree and compared it with the reference MRP supertree generated from 2,645 putatively orthologous sets in these 13 genomes (see Materials and Methods). The MRP supertree is shown in Fig. 4 a. Four Staphylococcus species are represented among the 13 genomes, each monophyletic (S. saprophyticus and S. haemolyticus trivially so) according to our supertree analysis. Using the MLST technique, we grouped the nine S. aureus isolates into five distinctive CCs: (i) the strains MW2 and MSSA476 in CC15, (ii) COL, USA300 and NCTC8325 in CC8, (iii) N315 and Mu50 in CC5, (iv) MRSA252 in CC30, and (v) RF122 in CC705. In our analysis, STs previously identified as CC1, i.e., those of MW2 and MSSA476, were grouped as CC15, likely due to the increased frequency of ST15 in the current database (http://saureus.mlst.net/); the naming convention of CC is based on the predicted ST founder, i.e., the most frequent ST (ST15 instead of ST1) found within the group. The supertree is rooted using S. saprophyticus as outgroup, based on a previous phylogenetic study of Staphylococcus species using small subunit rRNA genes (99). Whole-gene transfer was inferred in a gene set if the tree topology was significantly discordant with that of the reference supertree (see Materials and Methods).
The 1,034 gene sets can be divided into three classes based on the number of synologs present in each set. The number of gene sets with trees concordant or discordant with the reference supertree for each category is shown in Table 4. Among the 774 of these sets in SC, 97 (13%) show topological discordance and represent LGT involving orthologs (SC-WholeGT), whereas no evidence of LGT was found among the other 677 (87%), i.e., SC-noWholeGT. Among the 260 of these sets in MC, however, the outcome was very different: 210 (81%) are discordant (MC-WholeGT) vis-à-vis the reference topology and only 50 (19%) concordant (MC-noWholeGT). Of these 260 sets, 105 contain one or two synologs each, and 155 contain more than two. Of the 105 sets in the former group, 58 (55%) revealed evidence of whole-gene transfer, whereas, remarkably, among the 155 in the latter group, fully 152 (98%) did so. Using our synolog dereplication approach (Fig. 2), we could further classify the 58 MC-WholeGT sets with one or two synologs into 12 for which all synologs are paralogs but not xenologs (MC-WholeGT-P), 22 for which synologs are either paralogs or xenologs but not both (MC-WholeGT-P/X), and 19 with more-complex histories (MC-WholeGT-PX). The remaining five were reduced to gene sets with sizes of <4 upon our dereplication approach and thus do not provide meaningful topology comparison under this classification. Examples of tree topologies for each of these instances are shown in Fig. 4b through Fig. 4d; for each tree, a homologous gene copy from Bacillus cereus or B. thuringiensis was used as outgroup. Figure 4b shows the topology for staphylococcal genes encoding the function annotated as glyoxalase/bleomycin resistance protein/dioxygenases. In comparison to the reference supertree in Fig. 4a, the intraspecies gene sharing among lineages of S. aureus (not any of the gene copies in S. saprophyticus) contributed to the topological incongruence between the two trees, i.e., both the S. saprophyticus synologs are paralogs (an instance of MC-WholeGT-P). On the other hand, gene copies of S. saprophyticus encoding glucosamine-6-phosphate isomerase (Fig. 4c) show a different pattern (an instance of MC-WholeGT-P/X), in which a copy shows evidence of gene sharing with lineages of S. epidermidis or S. aureus (i.e., is a xenolog), whereas another has likely arisen by gene duplication (i.e., is a paralog). For the gene set encoding phage transcriptional activator (Fig. 4d), an important protein implicated in phage-mediated transduction, the evolutionary history of both S. aureus NCTC8325 gene copies could have involved duplication and/or LGT, of which the order of the events could not be determined using this approach (MC-WholeGT-PX).
Figure 5 shows the proportions of proteins in each JCVI functional category within (Fig. 5a) single-copy (SC-WholeGT) and (Fig. 5b) multicopy (MC-WholeGT) gene sets for which we inferred whole-gene transfer, compared to their frequencies in the full staphylococcal data set. The proteins in these sets are significantly either over- or under-represented in more than half of the JCVI categories, although with more-numerous and greater differences between SC and MC gene sets than was observed for fragmentary transfer. SC sets affected by whole-gene transfer are significantly over-represented, compared to expectation, in the protein synthesis, central intermediary metabolism, and transcription categories (in contrast to under-representation of these same categories in MC sets), suggesting that Staphylococcus genome lineages appear to have been more receptive to introgression of whole genes that encode for these protein functions when no indigenous copy was already present, or if an indigenous copy was present it was replaced or subsequently lost.
Correspondingly, gene sets annotated with these same functional categories are significantly under-represented in MC-WholeGT, suggesting that these genomes have been less receptive to the integration of genes encoding these functions when multiple copies already exist. The category cellular processes is over-represented in SC-WholeGT, particularly the functions involved in toxin production and resistance (P = 1.7 × 10−4), while the categories implicated in metabolic functions, including energy metabolism, are under-represented. The categories cell envelope and mobile and extrachromosomal genetic element functions are over-represented in MC-WholeGT. The categories fatty acid and phospholipid metabolism and signal transduction are under-represented in both SC-WholeGT and MC-WholeGT.
A summary of the results for all 1,354 gene sets used here, including their putative functional annotations, is available in File S2 in the supplemental material. The (protein and nucleotide) alignments and the Bayesian phylogenies for all 1,354 gene sets generated from the present study are available at http://bioinformatics.org.au/staphGT/. Of 559 gene sets that show evidence of LGT (both FragGT and WholeGT), 146 (26.1%) were not annotated with a function, while 277 (49.5%) have functions related to enzymatic reactions, e.g., kinases, isomerases, and transferases. The majority of genes related to resistance to antibiotics, drugs or heavy metals (11 of 15) are found to have undergone LGT; seven of these are in WholeGT, including genes conferring resistance to penicillin, methicillin, and heavy metals, including aluminum and arsenic (see Table S2 in the supplemental material). Of the three gene sets with annotated functions related to bacterial toxins, we found evidence of whole-gene transfer in the gene set encoding Txe family addiction module toxin but not in the other two, exfoliative and MazF toxins. In another group are gene sets showing evidence of LGT and annotated with functions related to membrane transport (39 of 59; see Table S3 in the supplemental material), especially cation transport, e.g., subunits of the monovalent cation/proton antiporter proteins.
Interestingly, among the 60 sets of ribosomal proteins in the 1,354 data set (and their related subunits, which are expected to be highly conserved), we found evidence of FragGT in six gene sets and WholeGT in another four (two of which are related to methyltransferases of rRNA small subunits). This is in addition to LGT evidence in 14 of 27 gene sets (9 of which are in WholeGT) that are related to transcriptional regulatory functions, e.g., transcription activators and elongation factors.
Within each ORB+ gene set, we further sought to determine whether the transferred coding regions correlate with protein structural domains. Using domain information available from the Pfam (32) and SCOP (1) databases, we introduced the ρ statistic to examine the potential tendency of LGT to disrupt domain-coding gene region, i.e., domon (see reference 18 and Materials and Methods). A ρ value of ≈1 indicates that the ORB is located at or outside the domon boundary. If there is a strong correlation between domon boundaries and ORB, the observed ρ values would be significantly larger than values uniformly distributed at random. We compared the distribution of observed ρ values against a uniform distribution (between 0 and 1), taking potential bias in large sample sizes into account via a random subsampling approach (10,000 subsamples of 50 values each, see Materials and Methods). Figure 6 shows the results of this analysis for SC gene sets (Fig. 6a), i.e., SC-FragGT and MC gene sets (Fig. 6b), i.e., MC-FragGT. For each, we show the observed distribution of ρ values (subpanel i) and, across the 10,000 Kolmogorov-Smirnov tests between an observed and an expected distribution, the distribution of D (subpanel ii) and P (subpanel iii) values. The D values indicate the magnitude of difference between the observed and expected distributions, whereas the P values indicate the statistical significance of such difference. Both of these values range between 0 and 1. In this instance, small P values indicate that the observed ρ values are larger than expected under a uniform distribution.
For the cases of SC-FragGT and MC-FragGT, any bias in the observed distribution of ρ values is unclear, as shown in panels i in Fig. 6. For SC-FragGT (Fig. 6a) we found small differences between observed and expected ρ values (Kolmogorov-Smirnov test, mean D value = 0.16; ca. 75% of the P values > 0.1). A similar trend was observed in MC-FragGT (Fig. 6b), with most (ca. 92%) of the observed P values > 0.1 (mean D value = 0.11).
We identified 37 gene sets among the 1,354 that encode functions known to be present in staphylococcal mobile genetic elements (MGEs) (see Table S4 in the supplemental material) as reported in previous studies (66, 69). Among these 37 gene sets, 31 (83.8%) show clear evidence of LGT (18 instances of WholeGT and 11 instances of FragGT); for 2 the evidence was inconclusive, and for 6 we found no evidence of LGT. Although functions related to MGE are significantly (P ≤ 0.01) over-represented in MC-WholeGT (Fig. 5b) and most MGE genes show high susceptibility to LGT, most of the gene sets we identified as affected by LGT are annotated with functions that extend beyond virulence or resistance to include a broad scope of central metabolism, transcription, and translation (289 instances in WholeGT and 241 instances in FragGT; see File S2 in the supplemental material).
Figure 7 shows the genome affinity for each of the 13 staphylococcal genomes, assessed by the best BLAST matches (most-similar protein sequences) for each of these 34,066 proteins in other staphylococcal isolates/species (see also Table S5 in the supplemental material). Most proteins (>85% across all genomes) from S. aureus and S. epidermidis find a best match or matches within the same species, suggesting high level of intraspecies genome affinity. S. haemolyticus and S. saprophyticus could not be evaluated in this way, since they are lone representatives of their respective named species among these complete and draft genomes. For S. aureus, we further examined whether these proteins are most similar to sequences from other isolates in the same CC or in another CC. Across the nine S. aureus genomes (constituting five different CCs), most proteins find their best matches outside their own CC: 57.9% (of total proteins in a genome) on average between the two CC5 isolates, 62.7% on average among the three CC8 isolates, 67.6% on average between the two CC15 isolates, and 77.3 and 96.5% for each of the genomes representing CC30 and CC705, respectively (see Table S5 in the supplemental material). This suggests a high level of inter-CC genome affinity among S. aureus isolates. CC5, CC8, and CC15 mutually share high genome affinities, since each is among the three CCs most similar to each of the other two (see Table S5 in the supplemental material). Genomes of CC30 are highly similar to those of CC30, CC42, and CC8, and those of CC705 are highly similar to those of CC8, CC10, and CC5, although for CC30 and CC705 in particular these affinities may represent rough approximations due to the limited data currently available.
Two-thirds of the strains of S. aureus in the present study (all except MSSA476, NCTC8325, and RF122) are methicillin resistant. According to the MRP supertree based on sets of protein-coding genes from the 13 genomes examined here (Fig. 4a), the methicillin-susceptible isolates nest within different clades in the supertree admixed with resistant isolates, implying multiple origins of methicillin resistance in S. aureus (20).
Among 1,354 sets of homologous genes, we found clear evidence of LGT in 368 gene sets (252 ORB+ sets in SC-FragGT and MC-FragGT, 97 sets in SC-WholeGT, and 19 sets in MC-WholeGT-PX, totaling 27.1% of 1,354). The remaining 191 sets in MC-WholeGT, together with the 68 inconclusive cases from analysis of within-gene transfer (totaling 259; 19.1% of 1,354), represent probable LGT, although the exact origins of synologs in these instances could not be determined. Leaving aside the 68 inconclusive cases, within-gene (252; 18.6%) and whole-gene (307; 22.7%) transfer contribute almost equally to the discordance of these gene sets against the reference species phylogeny. We observe a higher frequency of inferred genetic transfer involving multiple (MC) than single-copy (SC) gene sets, some of which can be explained by LGT alone, but most of which appears to reflect more-complex evolutionary histories involving multiple LGT and gene duplication events, gene conversion, and/or lineage sorting. On the other hand, no phylogenetic approach can detect recombination between genomic lineages that are terminal and adjacent on a given gene-set tree, suggesting that our estimates of recombination frequency are probably low.
LGT and gene duplication have been proposed to be the major factors contributing to functional innovation in prokaryotes (68, 103). Hooper and Berg (46) proposed that duplication may be more common among laterally transferred genes than among indigenous ones. Although our study was not explicitly designed to test this hypothesis, based on the results presented here we cannot reject this assertion, since we found evidence of LGT among a larger proportion of MC (364/433; 84.1%) than SC gene sets (195/921; 21.2%); see File S2 in the supplemental material. In addition, of the 11 gene sets related to antibiotic/drugs or heavy metal resistance that show evidence of LGT (see Table S2 in the supplemental material), 9 (81.8%) are MC. In contrast, gene sets conferring similar functions that show no evidence of LGT are mostly SC (four of five). Our observations suggest that both gene duplication and LGT are important in the maintenance of the resistance mechanisms to toxic drugs or heavy metals in Staphylococcus. This finding is in agreement with a recent study that demonstrates gene amplification as a key factor in reducing fitness costs associated with antibiotic resistance in bacteria (76). Assuming that maintenance of resistance mechanisms is deleterious in the absence of target molecules (e.g., antibiotics, drugs, or heavy metals), the dosage compensation of these (largely externally acquired) narrow-function genes by duplication is likely a regulatory response to maintain resistance (91).
Gene sets for which we infer LGT are represented at frequencies significantly different from random expectation in more than half of the JCVI functional role categories. The difference in the patterns of over- and under-representation between SC and MC gene sets is greater for whole-gene than for fragmentary genetic transfer. Gene sets involved in protein synthesis and affected by LGT, whether within-gene or whole-gene, are highly over-represented among SC sets but under-represented among MC sets, suggesting that these Staphylococcus genomes are more susceptible to introgression via LGT of genes related to amino acid synthesis when no similar copy is already present in the recipient genome than when multiple copies already exist. For whole-gene transfer, we likewise found significant bias in favor of SC sets in the frequency of protein sets engaged in toxin production and resistance. Genes related to toxin-antitoxin systems have recently been implicated in LGT between Staphylococcus and Listeria (19). The system is thought to be a selective mechanism for postsegregational killing of daughter cells that lack the genes, usually by cleaving or modifying nucleotides or ribosomes (101), e.g., the control of endoribonuclease activity by the well-studied MazFE system (25). Both genes encoding MazF toxin and MazE antitoxin are recovered as SC gene sets in our 1,354-set data. These two proteins are usually present together, forming a linear heterodimer structure (35, 54). We found evidence of whole-gene transfer implicating the MazF toxin in Staphylococcus, but not in the corresponding antitoxin MazE, suggesting different evolutionary histories between the two and that the antitoxin MazE is largely inherited vertically among the genus. Alternatively, highly similar (>90% nucleotide identity) sequences of MazE in Staphylococcus (e.g., due to convergence) could have obscured the LGT signal from detection using our approach.
MGEs have been reported to be the key agents of LGT that contribute to the rapid spread of virulence and antibiotic resistance in Staphylococcus (9, 65, 66, 69). Our findings, based on a subset of 1,354 gene sets, demonstrate clear evidence of LGT among staphylococci that extends beyond functions related to MGE and virulence. We found LGT to be more frequent among different CCs within S. aureus than between different staphylococcal species. Interestingly, of seven housekeeping genes represented in our 1,354 gene sets and commonly used for MLST analysis, three (glycerol kinase [glpF], phosphate acetyltransferase [pta], and triosephosphate isomerase [tpi]) show clear evidence of fragmentary transfer, implicating LGT even in the most conserved genes in the genus and again pointing to a footprint of LGT that extends much farther beyond MGE functions than previously appreciated (19, 96). This level of mutual gene sharing identifies genus Staphylococcus as a genetic exchange community (95), within which barriers against LGT appear to be low.
The complexity hypothesis (51) postulates that genes involved in functions related to complex regulatory networks such as transcription and translation (i.e., genes encoding informational proteins) are less likely to undergo LGT compared to genes encoding operational (e.g., metabolic) functions. Our results do not speak directly to this hypothesis but, interestingly, show that some of the most conserved genes that encode informational proteins (ribosomal proteins and transcription factors) are susceptible to both within-gene and whole-gene LGT in Staphylococcus. The introgression of these genetic fragments into the bacterial genome could be an important repair process to preserve gene function.
Almost all previous approaches to quantifying LGT in prokaryotes, including all those based on a phylogenetic approach (8, 63, 64, 85, 108), have been based on the assumption that the unit of genetic transfer is necessarily an entire (full-length) gene. Here, for within-gene genetic transfer in Staphylococcus genomes, we found only a very weak correlation between inferred breakpoints and domon boundaries. This observation is in agreement with a recent study (18) in which domons were found to not have been preferentially preserved intact during LGT among prokaryotes more broadly. This Staphylococcus data set, or indeed any within which homologous genes have diverged relatively little, may not be ideal for the phylogenetic exploration of possible relationships between recombined regions and domons, since the signal of recombination (or LGT) would not have been immediately obvious. Highly similar sequences are unlikely to be disruptive of protein domain structure, even when the recombination breakpoint is internal to a domon, and breakpoints become increasingly difficult to locate precisely at high similarity between introgressed and native sequences, as characteristic differences necessarily become less frequent.
Both LGT (whether of partial or complete genes) and genetic duplication shape the functional evolution of protein families in eukaryotes (3). Our results demonstrate that both processes have contributed to gene diversity and functional innovation among staphylococcal genomes. Indeed, since our approach cannot detect LGT between immediate sister genomes at the tips of the species tree (e.g., between lineages of highly similar S. aureus CCs), our results almost certainly underestimate the extent of genetic transfer in these genomes. Other limitations of the phylogenetic approach in delineating LGT have been described in detail elsewhere (5, 16, 59, 88). We have also identified a substantial class of genes for which there is evidence of a complicated evolutionary history that includes multiple events of genetic transfer and/or genetic duplication, e.g., recombination between two synologs or duplication of synologs. These results significantly enhance our understanding of genome evolution, particularly genetic transfer, in the genus Staphylococcus, including virulently pathogenic isolates.
This study was supported by Australian Research Council grant CE0348221.
We thank Aaron Darling and Vladimir Minin for valuable advice on the use of DualBrothers.
†Supplemental material for this article may be found at http://jb.asm.org/.
Published ahead of print on 27 May 2011.