|Home | About | Journals | Submit | Contact Us | Français|
Staphylococcus aureus is a versatile pathogen that shows high levels of inter-strain genetic variability and positive evolution in certain pathogenesis-related genes. Apart from gene content differences, variability in shared genes may affect pathogenicity. Studying such variability requires that the common minimal genome (CMG) be identified. In this study we have surveyed the CMG of S. aureus with respect to variability amongst orthologous family members, and determined that genes involved in pathogenesis preferentially accumulate variations. A negative correlation between variability of genes and their evolution was found, suggesting a preservation of host-specific function while exhibiting sequence diversity. Variation in key pathogenesis genes in S. aureus might predispose them to functional modulation, thereby playing an important role in evasion of host immunity.
Genome sequencing and analysis affords an opportunity to study the fundamental properties and changes that distinguish bacteria of different species or strains. In many bacteria, it has been shown that strain specific gene content differences lead to pathogenesis (Eschericia coli O157:H7 , Streptococcus spp [2, 3]), host specialization (Salmonella enterica serovar Typhi ), drug resistance (Staphylococcus aureus VRSA ), and modulation of host immune responses (S. aureus D30 and 930918-3 ). In S. aureus, availability of several genomes (14 human specific strains and 1 bovine pathogenic strain ) has enabled the study of genomic variation, both at the gene level and at the nucleotide level.
To study the gene level variation in a given species, it is necessary to identify a set of genes common to all strains . This set, known as the common minimal genome (CMG) or the core genome, defines the set of biological processes that are shared in all strains of the species, and hence the species itself. On the other hand, genes borne on mobile genetic elements like phages or transposons constitute the variable genome. The variable genome of S. aureus encodes genes for modulation of host immune responses  and for certain phenomena such as nasal carriage. However, major reported determinants of staphylococcal pathogenicity like the accessory gene regulator (Agr), clumping factors (clfA and clfB) and exotoxins are present in all S. aureus strains and thus are a part of the common minimal genome (CMG) suggesting a fundamental role for these genes in pathogenicity.
Even though differences in gene content explain major differences observed between strains, simple nucleotide level events like mutations, insertions, deletions, truncations and single nucleotide polymorphisms also play an important role in defining and modulating pathogenicity. In the bovine pathogenic strain of S. aureus ET3-S, the authors have shown positive evolution (dN/dS > 1) in several key pathogenesis genes, and argue that these genes have specifically evolved to adapt to a different host, the cow [10, 7]. In other pathogens like E. coli , and Xylella fastidiosa , positive evolution of genes has been attributed to adaptation to host type as well as enhancement in pathogenicity.
Positive evolution of genes has been measured using the Yang and Neilson metric (dN/dS ratio) under the assumption that non-synonymous changes would lead to differences in functionality. It is worth noting that there can be high degree of variation that does not manifest itself as a dN/dS ratio > 1. While positive evolution of genes and its effect on pathogenicity have been documented for several organisms including S. aureus, the role of gene level variations that do not lead evolution of the species has been neglected. In this study, we have termed this phenomenon variation in the absence of evolution while analyzing the effect of variations in the CMG of S. aureus on the pathogenicity of the organism. We first defined the CMG of S. aureus based on all 14 completely sequenced human specific strains available at the time of this study, and demonstrated that variation and evolution are not correlated in these strains. Interestingly, we determined that pathogenesis-related genes in S. aureus preferentially accumulated variations in the absence (dN/dS < 1) of evolution. Such variations, some of which occur in key amino acid residues of host adhesion molecules (clumping factors), drug transporters (QacC) and regulatory proteins (MarR family protein), likely affect their activity. We hypothesize that such variation in pathogenesis-related genes enables the pathogen to modulate infectivity in response to the host.
All sequences were obtained from the NCBI sequence database (ftp://ftp.ncbi.nih.gov/genomes/Bacteria). The strains used in this study are N315  (NC_002745), Mu50 (NC_002758), COL (NC_002951), MRSA252 (NC_002952), MSSA476 (NC_002953), MW2 (NC_003923), RF122 (NC_007622), USA300 (NC_007793), NCTC8325 (NC_007795), JH1 (NC_009632), JH9 (NC_009487), Mu3 (NC_009782) and USA300_TCH1516 (NC_010079). Available annotations for each genome were used, in conjunction with the clusters of orthologous groups (COG) classification  and manual curation for functional analysis. When multiple annotations were present, the most descriptive annotation was manually chosen to represent the function. It should be noted that COG does not have a family specific for pathogenicity. Hence, annotations for genes involved in pathogenicity were derived from published material for each such gene.
We used bidirectional best hit BLAST (gapped Blast reference) to identify orthologs . Since we were searching for a common group of orthologs in all the strains, the choice of a reference strain was random. We used the first sequenced strain N315 as a reference. All the genes in N315 that had orthologs in the other 13 human specific strains were considered a part of the common minimal genome (CMG). Protein sequences were obtained by translation of gene sequences. Multiple alignments were performed on each gene / protein family (containing 14 sequences each) using CLUSTALW .
In order to study the evolution of genes, we used the PAML package . Conversion of protein alignments to DNA alignments for use in PAML was performed using pal2nal . These alignment files were then used to calculate dN/dS ratio using the Nei-Gojobori method . Apart from calculating dN/dS ratio, we calculated the fraction of gaps and mismatches from the CLUSTAL alignments. The fraction was calculated as a ratio of number of gaps and mismatches in the alignments to the overall length of the alignment.
We used the TMHMM software  for calculating transmembrane regions in the proteins. For the set of genes with truncations, we predicted functional domains by searching against PFam using hmmpfam . All predictions and alignments were parsed using PERL scripts (supplementary methods).
We used the COG classification system for this study . The COG system classifies genes according to their function, in a hierarchical manner. There are four major groups of COGs consisting of a total of 25 specific subgroups. The major groups are: 1) information storage and processing, 2) cellular functions, 3) metabolism, and 4) poorly characterized functions. Information storage and processing (Inf) is comprised of genes involved in replication, transcription and translation along with those for chromatin structure and RNA processing. Cellular functions (Cel in this study) encompass genes involved in cell division, membrane biosynthesis, extracellular structures, signal transduction and protein processing. The group metabolism (Met) includes genes involved in all metabolic activities as well as genes that encode for metabolic transporters. Finally, those genes with a generic assigned function or no function at all were classified as poorly characterized (Hypothetical; Hyp in this study).
For each orthologous gene family, we calculated the fraction of variation (Fv) as follows: Fv = No. of mismatches or gaps in alignments / length of alignment. We chose the top 10% of genes with variation (Fv >=0.11) and named it the high variability group. In the high variability group, we studied the effects of both truncations and insertion / deletion events on variability. To compare the gene content of the high variability group, we used a set of genes with low variability (Fv <= 0.01).
We compared the distribution of COG families in each of the groups outlined above, using the chi-square test and the Fisher test. A P-value of 0.05 or less in the Fisher test was considered significant. We studied the correlation between dN/dS values and Fv values using Pearson’s correlation coefficient (PCC). For all these tests, we used the R statistical package (www.r-project.org).
The common minimal genome (CMG) of S. aureus was derived from 14 human-specific strains and consisted of 1888 genes (supplementary information 1). The COG distribution of the CMG reveals a large number of genes involved in metabolic activities, followed by cellular functions and hypothetical genes, in that order (Figure 1). In this study we analyzed the correlation between variation in the CMG and pathogenicity. In the following sections, we compared and contrasted the groups of genes showing high variability with those that show little or no variability.
Positive evolution (dN/dS > 1) of pathogenesis-related genes has been documented in bovine mastitis strains of S. aureus but not in the human isolates. In fact, our analysis of the CMG revealed that none of the known pathogenesis related genes had a dN/dS value of >1. However, we observed high levels of variability in many genes in the CMG. A scatter plot of Fv values versus the dN/dS values revealed a slightly negative correlation (PCC = −0.2138) between these two values for a given gene and its protein (Figure 2). Such correlation provides evidence for variation of genes in the absence of positive evolution, reflecting conservation of function with an altered / modulated activity for the gene product. In order to characterize genes exhibiting high-variability in the absence of positive evolution, we studied the top 10% of variable genes in the CMG. The choice of top 10% was statistically significant since the Fv cutoff value (Fv >= 0.11) was higher than Q3+1.5IQR (the sum of 3rd Quartile and 1.5 times the inter quartile range) of the distribution (data not shown). Apart from the overall functional composition of genes in the high variability group, we were also interested in identifying the processes that generate the variations. The high variability group consisted of 189 genes (top 10% of the CMG). A majority of them had truncations (117 genes) of which in 23 genes, the truncation extended into the proteins’ predicted functional domain (PFam prediction) producing putative pseudogenes. However, only two of these 23 genes (RbsU and MarR family regulatory protein) were associated with pathogenicity.
There were 36 genes that had either an insertion or deletion and variations that accounted for more than 11% of the alignment length. The insertions and deletions in each gene family were contributed by different strains, eliminating the possibility of variation from one outlier strain (data not shown). The composition of genes with insertions and deletions were significantly different from those of the overall CMG composition. This sub-group also had a high fraction of pathogenesis-associated genes. Notable examples of pathogenesis related genes included superantigen-like proteins (exotoxins), accessory gene regulatory protein B (AgrB), clumping factor A, and the virulence protein EssC. These genes are central to staphylococcal pathogenicity and aid in establishing (superantigen-like proteins and EssC) and regulating (AgrB) pathogenicity.
Even outside the high variability group, insertions and deletions in certain genes were apparently biologically significant. Case in point is the QacA gene, which is involved in resistance to decouplers (phenolics) used extensively in the hospital environment. In a previous study, it has been shown that hospital acquired MRSA have evolved strong resistance to decouplers by changes in the Qac locus. In our study, insertion in this gene was observed in only the hospital acquired MRSA strain (MRSA252), likely a factor in its increased survival in the chosen environment (hospitals). In genes of the high variability group with a transmembrane domain, insertions / deletions are present in functionally relevant regions. A similar trend was observed in genes of the high variability group that had an insertion / deletion. Genes that had an insertion / deletion in their active region (extracellular domain) were AgrB , EssC , Leukocidin S , SdrH  and ClfB , all of which have reported roles in pathogenesis. Another interesting protein that had insertions / deletions was a LPXTG domain protein. This motif is present in several surface associated pathogenesis genes of this pathogen, including sortase A and SdrA proteins [26, 24, 27]. Despite high levels of intra-family variation (within the orthologous family for each gene), none of these genes exhibited a dN/dS >1. These examples reveal the accumulation of variations in pathogenesis genes in the absence of quantifiable positive evolution (dN/dS < 1). These changes, while not changing the overall function of the protein, may alter the activity by affecting and changing key amino acid residues.
Of particular interest in the high variability group were the genes involved in nasal colonization. Clumping factors (clfA and clfB) [28, 29] and teichoic acid biosynthesis enzyme (tagX)  have been shown to be responsible for nasal colonization of S. aureus in humans and mice respectively. High SNP content of these genes may contribute to the high degree of variation in nasal colonization observed across various strains. The clumping factors mediate cell adherence (clfA) and nasal colonization (clfB). In both genes, variations are present throughout the reading frame with clfA being more variant than clfB. Similarly, in TagX protein (involved in nasal colonization), variations occur across the length of the protein and are not restricted to any particular region or domain. These results show that nucleotide level polymorphisms may play an important role in modulating nasal colonization and carriage of S. aureus. These results may also explain in part the lack uniform results in the search for factors determining nasal carriage. Figure 3 illustrates the variation in several key genes involved in pathogenicity.
Summarily, 23 of 189 genes (12.3%) of the high variability group were directly implicated in pathogenesis. A comparison with the low variability group (Fv <=0.01) reveals that this is a significant fraction (the fraction of pathogenesis genes in low variability group was ~0.2%; one gene out of 501). This comparison strongly suggests that pathogenesis genes in S. aureus preferentially accumulated variations at both the nucleotide and protein levels. That none of these genes evolved positively (for the pathogenesis genes, dN/dS < 1) points towards a functional conservation despite high variability. Some genes in the high variability have been already used for sequence typing of S. aureus while others described in this study might be suitable targets for novel typing methods. Our results highlight the importance of variation in the absence of positive evolution as an important survival strategy in the pathogen, S. aureus.
This work was funded by National Institutes of Health USA grant (RO1-AI060753), awarded to Alexander M Cole.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.