A total of 1,042 sequence reads were obtained from the four PSSH libraries. The filtration of sequences that matched the NCTC 8325 genome with at least 90% identity yielded 427 reads of interest, suggesting under these conditions, PSSH has an efficiency of 41%. This is comparable to efficiency previously reported for single strain SSH. Examination of these 427 reads revealed 16 sequences that did not match to the non-redundant nucleotide database at our cutoff levels, or 3.7% of the reads of interest. The breakdown of reads of interest obtained by pool and type is described (Table 2). The majority of these reads of interest had no homology to NCTC 8325 across their entire length. However, 55 sequences (12.9%) matched NCTC 8325 but were at least 10% divergent. Taken together these 427 reads represented 190,943bp of sequence. Each published S. aureus genome had some homologous sequence to at least one read in these pools.
When developing the PSSH approach, we first compared the efficiency of our PSSH assay with that of single strain SSH. We examined gene discovery efficiency using pool sizes ranging from 2 to 8 isolates per pool. There was no significant difference in efficiency of even the 8 or 10 strain pool when compared to single strain SSH (data not shown). Similar efficiency was previously reported for SSH experiments on
Helicobacter pylori (
Agron et al., 2002),
Staphylococcus xylosus (
Dordet-Frisoni et al., 2007) and
Streptococcus mutans (
Guo et al., 2006). This suggests PSSH is a powerful technique able to rapidly screen large libraries for novel ORFs or those potentially associated with virulence (or unique phenotypes).
We determined that PSSH is effective at detecting a wide range of genetic polymorphisms. These include: multi-gene operons, smaller insertions of one open reading frame or less, detecting homologs or orthologs that have some sequence diversity between comparable open reading frames and locating extrachromosomal elements such as plasmids which may be a factor in pathogenesis. One example of the power of this technique is the detection of SAR0158 SAR0159, SAR0160 and SAR0161 across three reads in the two Clonal Complex 30 pools. These genes (
cap8HIJK) are members of a 16 open reading frame cluster that is co-transcribed and involved in type 8 capsule biosynthesis. In further agreement with our results,
cap8HIJK has been reported as not detectable in NCTC 8325 by hybridization (
Sau et al., 1997). It is not surprising that this operon is found in our pool as 50% of clinical isolates are capsule type 8 by serology (
O'Riordan and Lee, 2004). By BLAST search (
Altschul et al., 1997)
cap8HIJK is found in four
S. aureus genomes (MRSA252, MW2, RF122 and MSSA476).
PSSH also allows for the detection of reads which have detectable homology to the driver strain but are significantly divergent. We considered a fragment unique from NCTC 8325 if it was either not found in that genome or had 90% or less identity to that sequence. Even with an average read accuracy of 99.4% (
Margulies et al., 2005), Sanger sequencing results in occasional base call errors. Therefore, it is important to set an identity cutoff that is stringent enough to prevent false positives from being entered into a library of unique fragments, but loose enough to allow detection of divergent sequence. Our use of 90% identity appeared to satisfy both conditions. As an example, a 427nt open reading frame, SAR2564, annotated to encode a putative membrane protein, was detected in the Clonal Complex 30 Endocarditis pool. SAR2564's detection serves to highlight the ability of PSSH to detect smaller polymorphisms. This locus has no homology to the chromosome of NCTC 8325 in its final 118nts. The flanking ORFs, SAR2563 and SAR2565, are conserved on NCTC 8325's genome with 92 and 90% identity, respectively.
PSSH is also able to detect sequence which sharply diverges from a well conserved ORF. Clone 1F05 (homologous to SAR2779 of MRSA252, an unstudied putative N-acetyltransferase) from the Clonal Complex 30 osteomyelitis pool had some 85 to 86% identity to a homologue on every published S. aureus genome, however it only matched MRSA252 with 100% identity. SAR2779 is strikingly different from its homologs, displaying 13% divergence in identity across its entire 801bp compared to its counterpart on each S. aureus genome contained in GenBank. This suggests that phenotypic differences between distantly related clonal complexes may be due to the slow accumulation of point mutations over time, in addition to the sudden uptake of horizontally transferred genes.
We also detected plasmid-like sequence in at least 60 reads of interest (14.1%). Given their multi-copy nature we were originally fearful that our libraries might be saturated with extrachromosomal elements. However, it appears that PSSH is effective in removing extreme imbalances in copy number and ensuring that no unique fragment is grossly overrepresented.
Analysis of sequence obtained by PSSH also provides insights into genetic horizontal transfer between distant genera. For example, clone 1C08 from the Clonal Complex 30 endocarditis pool was homologous to SAR0720, an unstudied putative cation exporting ATPase protein. Matching sequence was not found in any other genome beyond
S. aureus MRSA 252 and
Macrococcus caseolyticus JCSC5402 (Identity = 93%, expect = 4e−154, 100% query coverage). The predicted amino acid sequence of SAR0720 matched MCCL_0243 (Identity = 96%, expect = 0.0, 100% query coverage), a putative
M. caseolyticus JCSC5402 cation-transporting ATPase. The
Macrococcus is believed to be an ancestor of
S. aureus, possibly donating the methicillin resistance complex to create MRSA (
Baba et al., 2009). Another example is the detection of SAR0261, a putative nitric oxide reductase which is found only in one published
S. aureus genome, MRSA252 (
Holden et al., 2004). This has significant predicted protein homology to the nitric oxide reductases of many microbes, among them the
norB of
Neisseria meningitidis (
Householder et al., 2000;
Rock et al., 2007) (e value = 3e
−117), and the Gram positive dental pathogen
Lactobacillus fermentum (e value = 0). These results demonstrate the power of PSSH to efficiently detect horizontal gene transfer and detect environmental donors of virulence factors.
Several additional trends were noticed. Reads obtained from Clonal Complex 5 were more likely to be novel sequence (7.8%) not found in the non-redundant nucleotide database compared to Clonal Complex 30 (0.4%). These results suggest that the genomic content of S. aureus strains in this collection is divergent and similarities are likely to be found based on Clonal Complex rather than infection site. If Clonal Complex 30 sequences only matched a single S. aureus genome it was likely to be MRSA252, possibly due to the fact that MRSA252 is the only published Clonal Complex 30 genome available. There is yet to be a published representative genome for Clonal Complex 5 S. aureus. Had one been available we suspect that the number of novel sequences in detected in Clonal Complex 5 would have decreased. Clonal Complex 5 also had a high level of plasmid content compared to Clonal Complex 30. We also observed that while some reads overlapped the same open reading frame we did not see the significant level of repetition that would be expected if our sequencing power had saturated the PSSH libraries. Therefore we suspect that there are other ORFs in these libraries and associated with Endocarditis, Osteomyelitis and/or Clonal Complexes 5 and 30 that were not detected due to our limited data set.
Unlike other hybridization based methods that rely on a solid support matrix and/or foreknowledge of target genes (
Gerrish et al., 2007;
Herron-Olson et al., 2007), PSSH allows the user to detect previously unknown sequences without the time and expense of whole genome sequencing. PSSH allows the investigator to rapidly probe the genomes of numerous clinical isolates to determine which fragments are associated with a given phenotype, genetic background, or clinical outcome. We utilized PSSH to create enriched libraries of DNA fragments found in pools of ten strains but not found in a less virulent strain. This is the first description of PSSH and its first use to study the pangenome of Clonal Complex 5 and 30 clinical isolates.
In the study of bacterial genomes, SSH has mainly been utilized for the detection of differences between two genomes. A SSH approach could have been applied to identify unique sequences fragments found in our collection of S. aureus clinical isolates, but it would have been much more expensive and inefficient. A PSSH approach allows the investigator to probe large pools of strains for potential targets related to a phenotype, and then later tie these factors to individual strains. This methodology not only allows the researcher to sample entire populations present in a pangenome for novel factors contributing to a phenotype of interest but also confers a significant economic savings. As of this writing the most popular SSH commercially available SSH kit, the Clontech PCR-Select™ Bacterial Genome Subtraction Kit, has a per reaction cost of approximately $130 plus traditional Sanger sequencing costs. Utilizing SSH to analyze an entire microbial pangenome would quickly become prohibitively expensive and consume hours of labor with highly repetitive tasks. Data analysis would be complicated with the detection of the similar unique fragments across many isolates in the pangenome. However, PSSH permits a significant time and cost savings by analyzing numerous representatives from a given pangenome in parallel with the same efficiency and reliability as found in single strain SSH. Unique fragments are contained in the same library and can later be tied back to individual strains by PCR. Extremes in copy number due to plasmids or phage are reduced and relatively rare chromosomal polymorphisms can be detected with regularity.
Staphylococcus aureus is the causative agent of a diverse group of ailments the creation of a library of previously unstudied factors associated with discrete types of illness would be an initial step in understanding pathogenesis and proposing new treatment strategies. The strategy discussed in this communication produces targets for further study in the molecular basis of S. aureus disease. These results may enhance understanding of what bacterial factors are potentially responsible for pathogenesis and clinical outcome. PSSH may also be useful beyond the study of S. aureus pathogenesis. We propose the use of PSSH for the pangenomic analysis of any bacterial species.