Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
FEBS Lett. Author manuscript; available in PMC 2010 April 17.
Published in final edited form as:
PMCID: PMC2677416

Pathogenesis gene families in the common minimal genome of Staphylococcus aureus are hypervariable


Staphylococcus aureus is a versatile pathogen that shows high levels of inter-strain genetic variability and positive evolution in certain pathogenesis-related genes. Apart from gene content differences, variability in shared genes may affect pathogenicity. Studying such variability requires that the common minimal genome (CMG) be identified. In this study we have surveyed the CMG of S. aureus with respect to variability amongst orthologous family members, and determined that genes involved in pathogenesis preferentially accumulate variations. A negative correlation between variability of genes and their evolution was found, suggesting a preservation of host-specific function while exhibiting sequence diversity. Variation in key pathogenesis genes in S. aureus might predispose them to functional modulation, thereby playing an important role in evasion of host immunity.

Keywords: orthologous variation, common minimal genome


Genome sequencing and analysis affords an opportunity to study the fundamental properties and changes that distinguish bacteria of different species or strains. In many bacteria, it has been shown that strain specific gene content differences lead to pathogenesis (Eschericia coli O157:H7 [1], Streptococcus spp [2, 3]), host specialization (Salmonella enterica serovar Typhi [4]), drug resistance (Staphylococcus aureus VRSA [5]), and modulation of host immune responses (S. aureus D30 and 930918-3 [6]). In S. aureus, availability of several genomes (14 human specific strains and 1 bovine pathogenic strain [7]) has enabled the study of genomic variation, both at the gene level and at the nucleotide level.

To study the gene level variation in a given species, it is necessary to identify a set of genes common to all strains [8]. This set, known as the common minimal genome (CMG) or the core genome, defines the set of biological processes that are shared in all strains of the species, and hence the species itself. On the other hand, genes borne on mobile genetic elements like phages or transposons constitute the variable genome. The variable genome of S. aureus encodes genes for modulation of host immune responses [9] and for certain phenomena such as nasal carriage. However, major reported determinants of staphylococcal pathogenicity like the accessory gene regulator (Agr), clumping factors (clfA and clfB) and exotoxins are present in all S. aureus strains and thus are a part of the common minimal genome (CMG) suggesting a fundamental role for these genes in pathogenicity.

Even though differences in gene content explain major differences observed between strains, simple nucleotide level events like mutations, insertions, deletions, truncations and single nucleotide polymorphisms also play an important role in defining and modulating pathogenicity. In the bovine pathogenic strain of S. aureus ET3-S, the authors have shown positive evolution (dN/dS > 1) in several key pathogenesis genes, and argue that these genes have specifically evolved to adapt to a different host, the cow [10, 7]. In other pathogens like E. coli [11], and Xylella fastidiosa [12], positive evolution of genes has been attributed to adaptation to host type as well as enhancement in pathogenicity.

Positive evolution of genes has been measured using the Yang and Neilson metric (dN/dS ratio) under the assumption that non-synonymous changes would lead to differences in functionality. It is worth noting that there can be high degree of variation that does not manifest itself as a dN/dS ratio > 1. While positive evolution of genes and its effect on pathogenicity have been documented for several organisms including S. aureus, the role of gene level variations that do not lead evolution of the species has been neglected. In this study, we have termed this phenomenon variation in the absence of evolution while analyzing the effect of variations in the CMG of S. aureus on the pathogenicity of the organism. We first defined the CMG of S. aureus based on all 14 completely sequenced human specific strains available at the time of this study, and demonstrated that variation and evolution are not correlated in these strains. Interestingly, we determined that pathogenesis-related genes in S. aureus preferentially accumulated variations in the absence (dN/dS < 1) of evolution. Such variations, some of which occur in key amino acid residues of host adhesion molecules (clumping factors), drug transporters (QacC) and regulatory proteins (MarR family protein), likely affect their activity. We hypothesize that such variation in pathogenesis-related genes enables the pathogen to modulate infectivity in response to the host.

Materials and Methods

Sequences and functional annotations for protein families

All sequences were obtained from the NCBI sequence database ( The strains used in this study are N315 [13] (NC_002745), Mu50 (NC_002758), COL (NC_002951), MRSA252 (NC_002952), MSSA476 (NC_002953), MW2 (NC_003923), RF122 (NC_007622), USA300 (NC_007793), NCTC8325 (NC_007795), JH1 (NC_009632), JH9 (NC_009487), Mu3 (NC_009782) and USA300_TCH1516 (NC_010079). Available annotations for each genome were used, in conjunction with the clusters of orthologous groups (COG) classification [14] and manual curation for functional analysis. When multiple annotations were present, the most descriptive annotation was manually chosen to represent the function. It should be noted that COG does not have a family specific for pathogenicity. Hence, annotations for genes involved in pathogenicity were derived from published material for each such gene.

Sequence analysis for orthology, evolution and variation in gene families

We used bidirectional best hit BLAST (gapped Blast reference) to identify orthologs [14]. Since we were searching for a common group of orthologs in all the strains, the choice of a reference strain was random. We used the first sequenced strain N315 as a reference. All the genes in N315 that had orthologs in the other 13 human specific strains were considered a part of the common minimal genome (CMG). Protein sequences were obtained by translation of gene sequences. Multiple alignments were performed on each gene / protein family (containing 14 sequences each) using CLUSTALW [15].

In order to study the evolution of genes, we used the PAML package [16]. Conversion of protein alignments to DNA alignments for use in PAML was performed using pal2nal [17]. These alignment files were then used to calculate dN/dS ratio using the Nei-Gojobori method [18]. Apart from calculating dN/dS ratio, we calculated the fraction of gaps and mismatches from the CLUSTAL alignments. The fraction was calculated as a ratio of number of gaps and mismatches in the alignments to the overall length of the alignment.

We used the TMHMM software [19] for calculating transmembrane regions in the proteins. For the set of genes with truncations, we predicted functional domains by searching against PFam using hmmpfam [20]. All predictions and alignments were parsed using PERL scripts (supplementary methods).

Functional classification of genes used in the study

We used the COG classification system for this study [14]. The COG system classifies genes according to their function, in a hierarchical manner. There are four major groups of COGs consisting of a total of 25 specific subgroups. The major groups are: 1) information storage and processing, 2) cellular functions, 3) metabolism, and 4) poorly characterized functions. Information storage and processing (Inf) is comprised of genes involved in replication, transcription and translation along with those for chromatin structure and RNA processing. Cellular functions (Cel in this study) encompass genes involved in cell division, membrane biosynthesis, extracellular structures, signal transduction and protein processing. The group metabolism (Met) includes genes involved in all metabolic activities as well as genes that encode for metabolic transporters. Finally, those genes with a generic assigned function or no function at all were classified as poorly characterized (Hypothetical; Hyp in this study).

Gene groups used in this study

For each orthologous gene family, we calculated the fraction of variation (Fv) as follows: Fv = No. of mismatches or gaps in alignments / length of alignment. We chose the top 10% of genes with variation (Fv >=0.11) and named it the high variability group. In the high variability group, we studied the effects of both truncations and insertion / deletion events on variability. To compare the gene content of the high variability group, we used a set of genes with low variability (Fv <= 0.01).

Statistical Analysis

We compared the distribution of COG families in each of the groups outlined above, using the chi-square test and the Fisher test. A P-value of 0.05 or less in the Fisher test was considered significant. We studied the correlation between dN/dS values and Fv values using Pearson’s correlation coefficient (PCC). For all these tests, we used the R statistical package (

Results and Discussions

The common minimal genome (CMG) of S. aureus was derived from 14 human-specific strains and consisted of 1888 genes (supplementary information 1). The COG distribution of the CMG reveals a large number of genes involved in metabolic activities, followed by cellular functions and hypothetical genes, in that order (Figure 1). In this study we analyzed the correlation between variation in the CMG and pathogenicity. In the following sections, we compared and contrasted the groups of genes showing high variability with those that show little or no variability.

Figure 1
Distribution of COG groups in the CMG of S. aureus

Pathogenesis related genes contain high levels of variation

Positive evolution (dN/dS > 1) of pathogenesis-related genes has been documented in bovine mastitis strains of S. aureus but not in the human isolates. In fact, our analysis of the CMG revealed that none of the known pathogenesis related genes had a dN/dS value of >1. However, we observed high levels of variability in many genes in the CMG. A scatter plot of Fv values versus the dN/dS values revealed a slightly negative correlation (PCC = −0.2138) between these two values for a given gene and its protein (Figure 2). Such correlation provides evidence for variation of genes in the absence of positive evolution, reflecting conservation of function with an altered / modulated activity for the gene product. In order to characterize genes exhibiting high-variability in the absence of positive evolution, we studied the top 10% of variable genes in the CMG. The choice of top 10% was statistically significant since the Fv cutoff value (Fv >= 0.11) was higher than Q3+1.5IQR (the sum of 3rd Quartile and 1.5 times the inter quartile range) of the distribution (data not shown). Apart from the overall functional composition of genes in the high variability group, we were also interested in identifying the processes that generate the variations. The high variability group consisted of 189 genes (top 10% of the CMG). A majority of them had truncations (117 genes) of which in 23 genes, the truncation extended into the proteins’ predicted functional domain (PFam prediction) producing putative pseudogenes. However, only two of these 23 genes (RbsU and MarR family regulatory protein) were associated with pathogenicity.

Figure 2
Scatter plot of dN/dS vs Fv reveals presence of high variation without positive evolution

There were 36 genes that had either an insertion or deletion and variations that accounted for more than 11% of the alignment length. The insertions and deletions in each gene family were contributed by different strains, eliminating the possibility of variation from one outlier strain (data not shown). The composition of genes with insertions and deletions were significantly different from those of the overall CMG composition. This sub-group also had a high fraction of pathogenesis-associated genes. Notable examples of pathogenesis related genes included superantigen-like proteins (exotoxins), accessory gene regulatory protein B (AgrB), clumping factor A, and the virulence protein EssC. These genes are central to staphylococcal pathogenicity and aid in establishing (superantigen-like proteins and EssC) and regulating (AgrB) pathogenicity.

Even outside the high variability group, insertions and deletions in certain genes were apparently biologically significant. Case in point is the QacA gene, which is involved in resistance to decouplers (phenolics) used extensively in the hospital environment. In a previous study, it has been shown that hospital acquired MRSA have evolved strong resistance to decouplers by changes in the Qac locus. In our study, insertion in this gene was observed in only the hospital acquired MRSA strain (MRSA252), likely a factor in its increased survival in the chosen environment (hospitals). In genes of the high variability group with a transmembrane domain, insertions / deletions are present in functionally relevant regions. A similar trend was observed in genes of the high variability group that had an insertion / deletion. Genes that had an insertion / deletion in their active region (extracellular domain) were AgrB [21], EssC [22], Leukocidin S [23], SdrH [24] and ClfB [25], all of which have reported roles in pathogenesis. Another interesting protein that had insertions / deletions was a LPXTG domain protein. This motif is present in several surface associated pathogenesis genes of this pathogen, including sortase A and SdrA proteins [26, 24, 27]. Despite high levels of intra-family variation (within the orthologous family for each gene), none of these genes exhibited a dN/dS >1. These examples reveal the accumulation of variations in pathogenesis genes in the absence of quantifiable positive evolution (dN/dS < 1). These changes, while not changing the overall function of the protein, may alter the activity by affecting and changing key amino acid residues.

Of particular interest in the high variability group were the genes involved in nasal colonization. Clumping factors (clfA and clfB) [28, 29] and teichoic acid biosynthesis enzyme (tagX) [30] have been shown to be responsible for nasal colonization of S. aureus in humans and mice respectively. High SNP content of these genes may contribute to the high degree of variation in nasal colonization observed across various strains. The clumping factors mediate cell adherence (clfA) and nasal colonization (clfB). In both genes, variations are present throughout the reading frame with clfA being more variant than clfB. Similarly, in TagX protein (involved in nasal colonization), variations occur across the length of the protein and are not restricted to any particular region or domain. These results show that nucleotide level polymorphisms may play an important role in modulating nasal colonization and carriage of S. aureus. These results may also explain in part the lack uniform results in the search for factors determining nasal carriage. Figure 3 illustrates the variation in several key genes involved in pathogenicity.

Figure 3
Occurrence of polymorphism in key pathogenicity genes

Summarily, 23 of 189 genes (12.3%) of the high variability group were directly implicated in pathogenesis. A comparison with the low variability group (Fv <=0.01) reveals that this is a significant fraction (the fraction of pathogenesis genes in low variability group was ~0.2%; one gene out of 501). This comparison strongly suggests that pathogenesis genes in S. aureus preferentially accumulated variations at both the nucleotide and protein levels. That none of these genes evolved positively (for the pathogenesis genes, dN/dS < 1) points towards a functional conservation despite high variability. Some genes in the high variability have been already used for sequence typing of S. aureus while others described in this study might be suitable targets for novel typing methods. Our results highlight the importance of variation in the absence of positive evolution as an important survival strategy in the pathogen, S. aureus.

Table 1
Distribution of COG groups in the CMG, LoVar set and set of proteins with transmembrane domains.
Table 2
List of genes in InDel group with a reported link to pathogenesis


This work was funded by National Institutes of Health USA grant (RO1-AI060753), awarded to Alexander M Cole.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


1. Zhang Y, Laing C, Steele M, Ziebell K, Johnson R, Benson AK, Taboada E, Gannon VP. Genome evolution in major Escherichia coli O157:H7 lineages. BMC Genomics. 2007;8:121. [PMC free article] [PubMed]
2. Hols P, Hancy F, Fontaine L, Grossiord B, Prozzi D, Leblond-Bourget N, Decaris B, Bolotin A, Delorme C, Dusko Ehrlich S, Guédon E, Monnet V, Renault P, Kleerebezem M. New insights in the molecular biology and physiology of Streptococcus thermophilus revealed by comparative genomics. FEMS Microbiol Rev. 2005;29:435–463. [PubMed]
3. Prudhomme M, Libante V, Claverys JP. Homologous recombination at the border: insertion-deletions and the trapping of foreign DNA in Streptococcus pneumoniae. Proc Natl Acad Sci U S A. 2002;99:2100–2105. [PubMed]
4. Velge P, Cloeckaert A, Barrow P. Emergence of Salmonella epidemics: the problems related to Salmonella enterica serotype Enteritidis and multiple antibiotic resistance in other major serotypes. Vet Res. 2005;36:267–288. [PubMed]
5. Baba T, Bae T, Schneewind O, Takeuchi F, Hiramatsu K. Genome sequence of Staphylococcus aureus strain Newman and comparative analysis of staphylococcal genomes: polymorphism and evolution of two major pathogenicity islands. J Bacteriol. 2008;190:300–310. [PMC free article] [PubMed]
6. Sivaraman K, Venkataraman N, Tsai J, Dewell S, Cole AM. BMC genomics Genome sequencing and analysis reveals possible determinants of Staphylococcus aureus nasal carriage. BMC Genomics. 2008;9:433–433. [PMC free article] [PubMed]
7. Herron-Olson L, Fitzgerald JR, Musser JM, Kapur V. PLoS ONE Molecular correlates of host specialization in Staphylococcus aureus. PLoS ONE. 2007;2:e1120–e1120. [PMC free article] [PubMed]
8. Rasmussen TB, Danielsen M, Valina O, Garrigues C, Johansen E, Pedersen MB. Streptococcus thermophilus core genome: comparative genome hybridization study of 47 strains. Appl Environ Microbiol. 2008;74:4703–4710. [PMC free article] [PubMed]
9. Lindsay JA, Moore CE, Day NP, Peacock SJ, Witney AA, Stabler RA, Husain SE, Butcher PD, Hinds J. Microarrays reveal that each of the ten dominant lineages of Staphylococcus aureus has a unique combination of surface-associated and regulatory genes. J Bacteriol. 2006;188:669–676. [PMC free article] [PubMed]
10. Ben Zakour NL, Sturdevant DE, Even S, Guinane CM, Barbey C, Alves PD, Cochet MF, Gautier M, Otto M, Fitzgerald JR, Le Loir Y. Genome-wide analysis of ruminant Staphylococcus aureus reveals diversification of the core genome. J Bacteriol. 2008;190:6302–6317. [PMC free article] [PubMed]
11. Zhang W, Qi W, Albert TJ, Motiwala AS, Alland D, Hyytia-Trees EK, Ribot EM, Fields PI, Whittam TS, Swaminathan B. Probing genomic diversity and evolution of Escherichia coli O157 by single nucleotide polymorphisms. Genome Res. 2006;16:757–767. [PubMed]
12. Doddapaneni H, Yao J, Lin H, Walker MA, Civerolo EL. Analysis of the genome-wide variations among multiple strains of the plant pathogenic bacterium Xylella fastidiosa. BMC Genomics. 2006;7:225. [PMC free article] [PubMed]
13. Kuroda M, Ohta T, Uchiyama I, Baba T, Yuzawa H, Kobayashi I, Cui L, Oguchi A, Aoki K, Nagai Y, Lian J, Ito T, Kanamori M, Matsumaru H, Maruyama A, Murakami H, Hosoyama A, Mizutani-Ui Y, Takahashi NK, Sawano T, Inoue R, Kaito C, Sekimizu K, Hirakawa H, Kuhara S, Goto S, Yabuzaki J, Kanehisa M, Yamashita A, Oshima K, Furuya K, Yoshino C, Shiba T, Hattori M, Ogasawara N, Hayashi H, Hiramatsu K. Whole genome sequencing of meticillin-resistant Staphylococcus aureus. Lancet. 2001;357:1225–1240. [PubMed]
14. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. BMC bioinformatics The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41–41. [PMC free article] [PubMed]
15. Thompson JD, Gibson TJ, Higgins DG. Chapter 2, Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinformatics. 2002 Unit 2.3. [PubMed]
16. Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997;13:555–556. [PubMed]
17. Suyama M, Torrentas D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006;34:W609–W612. [PMC free article] [PubMed]
18. Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986;3:418–426. [PubMed]
19. Sonnhammer EL, von Heijne G, Krogh A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol. 1998;6:175–182. [PubMed]
20. Eddy SR. Hidden Markov models. Curr Opin Struct Biol. 1996;6:361–365. [PubMed]
21. Ji G, Pei W, Zhang L, Qiu R, Lin J, Benito Y, Lina G, Novick RP. Staphylococcus intermedius produces a functional agr autoinducing peptide containing a cyclic lactone. J Bacteriol. 2005;187:3139–3150. [PMC free article] [PubMed]
22. Burts ML, DeDent AC, Missiakas DM. EsaC substrate for the ESAT-6 secretion pathway and its role in persistent infections of Staphylococcus aureus. Mol Microbiol. 2008;69:736–746. [PMC free article] [PubMed]
23. Monecke S, Kuhnert P, Hotzel H, Slickers P, Ehricht R. Veterinary microbiology Microarray based study on virulence-associated genes and resistance determinants of Staphylococcus aureus isolates from cattle. Vet Microbiol. 2007;125:128–140. [PubMed]
24. McCrea KW, Hartford O, Davis S, Eidhin DN, Lina G, Speziale P, Foster TJ, Höök M. The serine-aspartate repeat (Sdr) protein family in Staphylococcus epidermidis. Microbiology. 2000;146(Pt 7):1535–1546. [PubMed]
25. DeDent A, Bae T, Missiakas DM, Schneewind O. Signal peptides direct surface proteins to two distinct envelope locations of Staphylococcus aureus. EMBO J. 2008;27:2656–2668. [PubMed]
26. DeDent AC, McAdow M, Schneewind O. Distribution of protein A on the surface of Staphylococcus aureus. J Bacteriol. 2007;189:4473–4484. [PMC free article] [PubMed]
27. O'Neill E, Pozzi C, Houston P, Humphreys H, Robinson DA, Loughman A, Foster TJ, O'Gara JP. A novel Staphylococcus aureus biofilm phenotype mediated by the fibronectin-binding proteins, FnBPA and FnBPB. J Bacteriol. 2008;190:3835–3850. [PMC free article] [PubMed]
28. Ganesh VK, Rivera JJ, Smeds E, Ko YP, Bowden MG, Wann ER, Gurusiddappa S, Fitzgerald JR, Höök M. A structural model of the Staphylococcus aureus ClfA-fibrinogen interaction opens new avenues for the design of anti-staphylococcal therapeutics. PLoS Pathog. 2008;4:e1000226. [PMC free article] [PubMed]
29. Wertheim HF, Walsh E, Choudhurry R, Melles DC, Boelens HA, Miajlovic H, Verbrugh HA, Foster T, van Belkum A. Key role for clumping factor B in Staphylococcus aureus nasal colonization of humans. PLoS Med. 2008;5:e17. [PubMed]
30. Weidenmaier C, Kokai-Kun JF, Kristian SA, Chanturiya T, Kalbacher H, Gross M, Nicholson G, Neumeister B, Mond JJ, Peschel A. Role of teichoic acids in Staphylococcus aureus nasal colonization, a major risk factor in nosocomial infections. Nat Med. 2004;10:243–245. [PubMed]