|Home | About | Journals | Submit | Contact Us | Français|
Due to the possibility of a biothreat attack on civilian or military installations, a need exists for technologies that can detect and accurately identify pathogens in a near-real-time approach. One technology potentially capable of meeting these needs is a high-throughput mass spectrometry (MS)-based proteomic approach. This approach utilizes the knowledge of amino acid sequences of peptides derived from the proteolysis of proteins as a basis for reliable bacterial identification. To evaluate this approach, the tryptic digest peptides generated from double-blind biological samples containing either a single bacterium or a mixture of bacteria were analyzed using liquid chromatography-tandem mass spectrometry. Bioinformatic tools that provide bacterial classification were used to evaluate the proteomic approach. Results showed that bacteria in all of the double-blind samples were accurately identified with no false-positive assignment. The MS proteomic approach showed strain-level discrimination for the various bacteria employed. The approach also characterized double-blind bacterial samples to the respective genus, species, and strain levels when the experimental organism was not in the database due to its genome not having been sequenced. One experimental sample did not have its genome sequenced, and the peptide experimental record was added to the virtual bacterial proteome database. A replicate analysis identified the sample to the peptide experimental record stored in the database. The MS proteomic approach proved capable of identifying and classifying organisms within a microbial mixture.
The detection and accurate identification of pathogens of biological origin are of great importance to the armed forces and civilian sectors. Achieving these tasks is vital in the response to manmade or natural biothreat attacks in a proper and efficient manner to minimize the outbreak of epidemic cases. Several approaches reported in the literature have addressed the detection and identification of microorganisms based on the characterization of metabolites (1, 17) and genomic contents of bacterial cells (16). In these studies, the genomic sequence similarities generated from PCR were used to group bacteria at the genus/species level (27). Prior knowledge of the sample, or the targeting of one or a group of biological substances, is required in PCR techniques for proper primer utilization. However, proteins constitute greater than 60% of the dry weight of microorganism cellular components (4, 8, 12, 13, 22) and could provide in-depth information for the bacterial differentiation of species and their strains. Moreover, advancements in mass spectrometry (MS) ionization, detection methods, and data processing make MS a suitable analytical technique for the differentiation of microorganisms (5-7).
Using MS techniques for bacterial differentiation relies on the comparison of the proteomic information generated from the analysis of either intact protein profiles (top down) or the product ion mass spectra of digested peptide sequences (bottom up) (24, 26). For top-down analysis, bacterial differentiation is accomplished through the comparison of the MS data of intact proteins to those of an experimental mass spectral database containing the mass spectral fingerprints of the studied microorganisms (6, 7). Conversely, bacterial differentiation using the product ion mass spectral data of digested peptide sequences is accomplished through the utilization of search engines for publicly available sequence databases to infer identification (25, 29). Several peptide-searching algorithms (i.e., SEQUEST and MASCOT) have been developed to address peptide identification using proteomics databases that were generated from either fully or partially genome-sequenced organisms (6, 11, 19). Thus, our approach is based on a cross-correlation between the generated product ion mass spectra of tryptic peptides and their corresponding bacterial proteins resident in an in-house comprehensive proteome database from online databases of the sequences of microorganism genomes (30).
Recent developments in the microbial differentiation field have focused on improving the selectivity of MS data processing. The product ion mass spectrum-SEQUEST approach was reported for the identification of specific bacteria using a custom-made, limited database of sequences (14, 23). Another approach used open reading frame (ORF) translator programs to predict possible protein sequences from all probable ORFs and correlate them with the genomic sequences to establish an identification of microorganisms (5). This approach did not show advantages over the product ion mass spectrum method with regard to strain level discrimination (28). However, a recent advancement in proteomic approaches to bacterial differentiation reported a hybrid approach combining protein profiling and sequence database searching using accurate mass tags (15, 18). This approach was used to probe defined mixtures of bacteria to evaluate its capabilities.
Alternatively, our approach is based on a cross-correlation between the product ion spectra of the tryptic peptides and their corresponding bacterial proteins derived from an in-house comprehensive proteome database from genome-sequenced microorganisms (9, 10). The exploitation of this proteome database approach allowed for a faster search of the product ion spectra than that using genomic database searching. Also, it eliminates inconsistencies observed in publicly available protein databases due to the utilization of nonstandardized gene-finding programs during the process of constructing the proteome database. The proposed approach uses an ensemble of bioinformatic tools for the classification and potential identification of bacteria based on the peptide sequence information. This information is generated from the liquid chromatography-tandem mass spectrometry (LC-MS-MS) analysis of tryptic digests of bacterial protein extracts and the subsequent profiling of the sequenced peptides to create a matrix of sequence-to-bacterium (STB) assignments. This proteomic approach is an unsupervised approach to reveal the relatedness between the analyzed samples and the database of microorganisms using a binary matrix approach. The binary matrix is analyzed using diverse visualization and multivariate statistical techniques for bacterial classification and identification.
This study investigated the capability of the aforementioned MS-based proteomic approach to identify biological agents using double-blind (hereafter referred to as blind) samples that consisted of various microorganisms of interest to civilian and military installations. The present study included category A biological agents, mixtures of organisms, and negative controls without prior knowledge of the identity of the microorganisms. The in-house database consists of 881 microbial genomes as of 2 May 2009. The identification process for all samples revealed that several samples consisted of a mixture of bacterial species. The results of the blind studies showed a promising outlook for applying this MS-based proteomic approach to the classification of unknown bacterial mixtures at the species and strain level depending on the availability of complete genome sequences.
Ammonium bicarbonate, dithiothreitol, urea, acetonitrile (ACN; high-performance liquid chromatography [HPLC] grade), and formic acid were purchased from Burdick and Jackson (St. Louis, MO). Sequencing-grade modified trypsin was purchased from Promega (Madison, WI).
Twenty-one blind biological samples were prepared by streaking cells from cryopreserved stocks onto appropriate agar. Bacillus subtilis, Bacillus thuringiensis, Staphylococcus aureus, Enterococcus faecalis, and Pseudomonas aeruginosa were streaked onto tryptic soy agar (TSA; catalog number CM100; Culture Media and Supplies, Oswego, IL) plus 5% sheep blood. Burkholderia thailandensis and Clostridium phytofermentans ISDg were streaked onto nutrient agar (NA; catalog number CM145; Culture Media and Supplies, Oswego, IL). All plates were incubated for approximately 18 h at 37°C and stored at 4°C for no longer than 10 days. Cells from plate cultures were used to inoculate liquid cultures consisting of 10 ml of tryptic soy broth (TSB; catalog number CM104; Culture Media and Supplies, Oswego, IL) for B. subtilis, B. thuringiensis, S. aureus, E. faecalis, P. aeruginosa, and nutrient broth (NB; catalog number CM146; Culture Media and Supplies, Oswego, IL) for B. thailandensis. All liquid cultures were incubated for approximately 18 h at 37°C with rotary aeration at 180 rpm. After incubation, bacteria from liquid cultures were harvested by centrifugation (2,300 relative centrifugal force [RCF] at 4°C for 10 min), washed, and resuspended in an equal volume of phosphate-buffered saline (PBS). The Bacillus species were observed under a microscope to consist predominately of spores. Samples provided for analysis consisted of either a single bacterium or multiple bacteria mixed together. For mixed samples, all bacteria were added in a ratio of 1:1 by volume. All bacteria were present at a concentration between 10E7 to 10E9 CFU/ml as determined by serial dilution and plating onto appropriate agar. All samples were produced at the microbiology laboratory at the U.S. Army Edgewood Chemical Biological Center in a blind format and were assigned number codes for processing and analysis. The identities of all blind samples were revealed upon the completion of all analyses. A negative control sample also was included that consisted of PBS only (no bacteria).
The lysis of all blind samples was performed using a modified sonication method (2, 20, 21). All blind samples, including any sporulated bacteria, were lysed by microprobe ultrasonication (Branson 450 digital sonifier; Branson, Danbury, CT). The blind samples were placed on ice and lysed with a 20-s pulse on and 5-s pulse off (cooling time) and 25% amplitude for a 5-min duration. To verify that the cells were disrupted, a small portion of the lysate was examined under confocal microscopy, and another portion was reserved for one-dimensional gel analysis.
The lysate was centrifuged at 14,100 × g for 30 min to remove all cellular debris. The supernatant then was added to a Microcon YM-3 filter unit (catalog number 42404; Millipore) and centrifuged at 14,100 × g for 30 min. The effluent was discarded. The filter membrane was washed with 100 mM ammonium bicarbonate (ABC) and centrifuged for 15 to 20 min at 14,100 × g. Cellular proteins were denatured by adding 8 M urea and 3 μg/μl dithiothreitol (DTT) to the filter and incubating it overnight at 37°C on an orbital shaker set to 60 rpm. Twenty microliters of 100% acetonitrile was added to the tubes and allowed to incubate at room temperature for 5 min. The tubes then were centrifuged at 14,100 × g for 30 to 40 min and washed three times using 150 μl of 100 mM ABC solution. On the last wash, the ABC solution was shaken for 20 min, followed by centrifugation at 14,100 × g for 30 to 40 min. The filter unit then was transferred to a new receptor tube, and proteins were digested with 5 μl of trypsin in 240 μl of ABC solution plus 5 μl ACN. Protein digestion occurred overnight at 37°C on an orbital shaker set to 55 rpm. Sixty microliters of 5% ACN-0.5% formic acid (FA) was added to each filter to quench the trypsin digestion, followed by 2 min of vortexing for sample mixing. The tubes were centrifuged for 20 to 30 min at 14,100 × g. An additional 60 μl 5% ACN-0.5% FA mixture was added to the filter and centrifuged. Alternative protocols were used in which the denaturation step was eliminated, and the digestion time was reduced using various amounts of trypsin and different digestion temperatures. The effluent then was analyzed using LC-MS-MS.
The tryptic peptides were separated using a capillary Hypersil C18 column (300 Å; 5 μm; 0.1 mm [inner diameter] by 100 mm) by using the Surveyor LC from ThermoFisher (San Jose, CA). The elution was performed using a linear gradient from 98% A (0.1% FA in water) and 2% B (0.1% FA in ACN) to 60% B for 60 min at a flow rate of 200 μl/min, followed by 20 min of isocratic elution. The resolved peptides were electrosprayed into a linear ion trap mass spectrometer (LTQ; Thermo Scientific, San Jose, CA) at a flow rate of 0.8 μl/min. Product ion mass spectra were obtained in the data-dependent acquisition mode that consisted of a survey scan across the m/z range of 400 to 2,000, followed by seven scans on the most intense precursor ions activated for 30 ms by an excitation energy level of 35%. A dynamic exclusion was activated for 3 min after the first MS-MS spectrum acquisition for a given ion. Uninterpreted product ion mass spectra were searched against a microbial database with TurboSEQUEST (Bioworks 3.1; Thermo Scientific, San Jose, CA) followed by the application of an in-house proteomic algorithm for bacterial identification.
A protein database was constructed in a FASTA format using the annotated bacterial proteome sequences derived from the sequenced chromosomes of 881 bacteria, including their sequenced plasmids (as of May 2009). A PERL program (ActiveState) was written to automatically download these sequences from the National Institutes of Health National Center for Biotechnology (NCBI) site (http://www.ncbi.nlm.nih.gov). Each database protein sequence was supplemented with information about the source organism and the genomic position of the respective ORF embedded into a header line. The database of bacterial proteomes was constructed by translating putative protein-coding genes and consists of tens of millions of amino acid sequences of potential tryptic peptides obtained by the in silico digestion of all proteins (assuming up to two missed cleavages).
The experimental product ion mass spectral data of bacterial peptides were searched using the SEQUEST (11) algorithm against a constructed proteome database of microorganisms. The SEQUEST thresholds for searching the product ion mass spectra of peptides were Xcorr, deltaCn (DelCn), Sp, RSp, and deltaMpep (DelM). These parameters provided a uniform matching score for all candidate peptides. The generated outfiles of these candidate peptides then were validated using the PeptideProphet algorithm (14). Peptide sequences with a probability score of 95% and higher were retained in the data set and used to generate a binary matrix of STB assignments. The binary matrix assignment was populated by matching the peptides with corresponding proteins in the database and assigning them a score of one. A score of zero was assigned for a nonmatch. The column in the binary matrix represents the proteome of a given bacterium, and each row represents a tryptic peptide sequence from the LC-MS-MS analysis. Microorganisms in a blind sample were matched with the bacterium/bacteria based on the number of unique peptides that remained after the filtering of degenerate peptides from the binary matrix. The verification of the classification and identification of candidate microorganisms was performed through hierarchical clustering analysis and taxonomic classification (Fig. (Fig.11).
The SEQUEST-processed product ion mass spectra of the peptide ions were compared to an NCBI protein database using the in-house-developed software (BACid). BACid (10) provided a taxonomically meaningful and easy-to-interpret output. BACid calculates the probabilities that a peptide sequence assignment to a product ion mass spectrum is correct and uses accepted spectrum-to-sequence matches to generate an STB binary matrix of assignments. Validated peptide sequences, either present or absent in various strains (STB matrices), were visualized as assignment bitmaps and analyzed by a BACid module that used phylogenetic relationships among bacterial species as part of a decision tree process. The bacterial classification and identification algorithm used assignments of organisms to taxonomic groups (phylogenetic classification) based on an organized scheme that begins at the phylum level and follows through the class, order, family, genus, and strain levels. BACid was developed in-house using PERL, MATLAB, and Microsoft Visual Basic.
The capabilities, and possible limitations, of the proteomic approach with regard to the identification of biological agents were evaluated using blind biological samples. Twenty-one blind microbial samples were provided and analyzed by the LC-MS-MS proteomic approach. The composition of the blind samples varied, with some samples having only one bacterium and others having as many as five different bacterial species or strains.
An example of the resultant data from the BACid program for one blind sample is shown in Fig. Fig.2.2. Blind sample 20 was identified as B. subtilis 168 using the BACid data-processing algorithm. This identification algorithm eliminated all of the unwanted and degenerate peptides and retained only the unique peptides that represent a 99% probability for correct identification. In this case, 212 unique peptides were identified and associated with proteins from the B. subtilis 168 strain. The 212 B. subtilis 168 unique peptides represented 89% of the total number of unique peptides in the blind sample. Table Table11 shows a select set of unique peptides and their corresponding proteins that are associated with B. subtilis 168. These bacterial proteins have different cellular functions, such as transcription, translation, and cellular signaling. They represent a set of unique biomarkers that could be utilized to establish strain-level discrimination between B. subtilis 168 and other members of the Bacillus genus.
To ensure confidence in the assignment of the candidate bacterium, a similarity analysis was performed on the nearest-neighbor species and strains. In this similarity analysis, all sequenced strains of B. subtilis and Bacillus species that are genetically related to the candidate bacterium were included in the Euclidean distance dendrogram. Figure Figure33 shows a dendrogram of the similarity analysis of the blind sample identified as B. subtilis 168. In Fig. Fig.3,3, the sample was identified as being most similar to B. subtilis 168 using the unique peptides that were associated with this bacterial candidate. The next closest bacterium to the candidate was determined to be Bacillus licheniformis ATCC 14580. According to these similarities, a comparison of B. licheniformis and B. subtilis 168 analyses showed a difference of almost 50% in the unique proteins identified by the BACid algorithm. Based on these significant differences and a lower degree of confidence assigned, B. licheniformis was not included as a candidate bacterium. Therefore, the identity of sample 20 was assigned to B. subtilis 168 using the BACid algorithm. This assignment was correct as later revealed at the completion of the tests.
The BACid analysis of sample 18 is shown in Fig. Fig.4.4. BACid eliminated all of the unwanted and degenerate peptides, and only the unique peptides that represented a 99% confidence level and above were retained for each organism. In this case, the number of unique peptides varied for the different bacterial candidates. E. faecalis had the highest number of unique peptides followed by B. thuringiensis, and B. thailandensis had the least number of unique peptides. Interestingly, it was revealed that after the tests the blind samples had approximately equivalent bacterial concentrations for each organism, yet the number of unique peptides differed. This variation in the number of unique peptides in the output of the BACid could be due to the dynamic nature of the bacterial species during sample processing. Some bacteria could have a larger number of lysed proteins that were suspended in the extraction buffer than that of other species in the sample. This difference in bacterial protein concentrations is shown in the histogram in Fig. Fig.44 generated from the BACid output, where the relative number of peptides for each species is compared to that of the other species. This feature in the BACid algorithm could be used as a pseudoquantitative technique in the determination of lysed bacterial proteins in a biological sample and thus aid in evaluating sample-processing modules. Also shown in Fig. Fig.44 are six bacterial candidates near the cutoff threshold within the Staphylococcus genus. This pattern is due to the fact that the Staphylococcus aureus ATCC 3359 strain present in the blind sample has not been sequenced or reported in the public domains, and thus it was not part of the constructed proteome database. However, the BACid was capable of providing a nearest-neighbor match to the species level (S. aureus) and thus identified the bacterium correctly as S. aureus subsp. aureus. It is noteworthy that this bacterial strain, which is not genomically sequenced, could be identified only to the species level. The rapid increase in the number of sequenced bacteria will benefit this proteomic approach and enhance its robustness in the identification process of biological samples. However, a significant advantage of the approach is that if a particular strain has not been sequenced but the species is represented in the database, it is highly likely that the unsequenced sample strain will be identified to the species level. The appearance of the histogram from a BACid analysis indicates the degree of the accuracy of the identification process. Strain-level experimental identification is indicated by a single line (Fig. (Fig.4)4) in the histogram (Enterococcus faecalis V538) or by a grouping of lines, where one line clearly dominates (e.g., Burkholderia thailandensis E264 and Pseudomonas aeruginosa PAO1) with respect to the number of unique peptides. B. thuringiensis has two strains resident in the database, and both provide a similar set of peptides. This occurs because the two strains do not display peptides that clearly distinguish themselves. The fifth bacterium in the sample 18 mixture was S. aureus strain ATCC 3359, and this organism's genome has not been sequenced. However, the species-level identification (S. aureus) of this strain is indicated by a grouping of lines (Fig. (Fig.4)4) that does not display a significant difference in the number of unique peptides. This blind sample was correctly identified as a mixture of five bacteria: B. thuringiensis, S. aureus subsp. aureus, E. faecalis V583, B. thailandensis E264, and P. aeruginosa PA01, where S. aureus and B. thuringiensis were identified to the species level, and the other three were identified to the strain level.
The in-house database originated from 881 genomically sequenced bacterial strains. The blind sample suspensions consisted of bacteria in single and mixture forms, and their genomes were sequenced or not sequenced. The bacterial strains found in experimental samples that do not have a sequenced genome, therefore, cannot be found in available public databases and the in-house database. Figure Figure5a5a shows the classification map of the 21 experimentally processed blind samples, and Fig. Fig.5b5b shows that of the bacterial strain sample identities (sample key). In Fig. Fig.5a,5a, the bacteria on the abscissa reflect every bacterium found at least once in the 21 experimentally determined samples. The bacteria listed in Fig. Fig.5a5a were not disclosed in advance; rather, all 21 experiments produced the bacterial identities from the BACid algorithm (10). Figure Figure5b5b represents the sample key or actual bacterial species and strains in the blind samples. This information was not released to the authors until the Fig. Fig.5a5a results were turned in for experimental performance verification. A comparison between Fig. 5a and b shows that bacterial discrimination was achieved by relying on the unique peptides corresponding to the bacteria in the blind samples. An identification was based on the matching probability of the unique peptides from a blind sample with a bacterial entry in the bacterial proteome database at more than a P = 0.95 confidence level. The strain-level identification, indicated by the filled red boxes in Fig. Fig.5a,5a, was assigned due to a close match with the analyzed microorganisms' unique peptides and their nearest-neighbor strains.
Figure Figure44 shows the analysis of sample 18 and provides an example of identification to the strain level as well as classification to the species level (as described above) for Staphylococcus aureus strain ATCC 3359, which is not currently sequenced. A correct species level of identification was experienced with all bacteria in the blind samples that are unsequenced and are indicated by a vertical hashed box in Fig. Fig.5a.5a. Thus, the classification probability was statistically high enough based on a comparison of the virtual proteome of a database strain and the experimental unique proteins of the unsequenced-genome bacterial sample. Therefore, identification was reported at the species level. Blind sample 20 (Fig. (Fig.2)2) was identified as B. subtilis; however, the sample key reported it as B. atrophaeus. This difference is due to the lack of a proteome for B. atrophaeus, which taxonomically is considered B. subtilis. Our data support the proposition that B. atrophaeus should be reclassified as a strain of B. subtilis (3) (the gray square for sample 20 in Fig. 5a and b).
Blind sample 17 was investigated for BACid characterization. The experimental set of peptides could provide results only to the Clostridium genus level, because all nine clostridial bacteria (species and strains) resident in the database produced a histogram (data not shown) similar to that of Staphylococcus aureus, which is shown in Fig. Fig.4.4. The experimental peptides matched that portion of the virtual proteome common to all Clostridia. Therefore, the complete experimentally derived, tryptic peptide information record was stored as a separate bacterial line item as Clostridium species 1 in the database of 881 bacteria. Another aliquot of the blind sample was processed with data reduction and searching in the new hybrid database. The highest match was with the Clostridium species 1 entry. After the results were submitted, the identity of sample 17 was revealed to be Clostridium phytofermentans ISDg. This strain does not have its genome sequenced, yet BACid was able to match the virtual proteins that are similar to those of the Clostridium genus to the experimentally observed peptides. Thus, BACid was able to characterize sample 17 as Clostridium without choosing one of the nine clostridia strains resident in the database or other bacterial genera. BACid instead matched Clostridium species 1 to the experimental peptides, which indicated that there is sufficient information in the experimental peptides to differentiate Clostridium phytofermentans ISDg from the nine database clostridia strains. It is tempting to consider that this approach, when combined with the accurate mass tag approach of Lipton et al. (15), has the potential to diminish the impact of genome-sequencing deficiencies for some bacterial strains. The rapid advancement in genome-sequencing projects will enhance the robustness of this approach through the expansion of the proteome database. This expansion in the proteome database is anticipated to include the cellular proteins that can be utilized for strain-level differentiation.
The results showed that the method was effective in identifying bacteria whether the sample was composed of one organism or a mixture, or even if the sample is not resident in the database. No false positives were observed for any of the blind samples that were analyzed, including the blank sample. The proteomic MS approach reported herein is not meant as a replacement for DNA-based identification methods. We envision this approach as a second, confirmatory approach to pathogen identification. Additionally, there are some major advantages to the proteomic method over other molecular biology methods such as the DNA-based methods, in that (i) no prior information about the sample is required for analysis; (ii) no specific reagents are needed in the analysis process; (iii) proteomic MS is capable of identifying an organism when a primer/probe set is not available; (iv) proteomic MS requires less rigorous sample preparation than PCR; and (v) proteomic MS can provide a presumptive identification of a true unknown organism by mapping its phylogenetic relationship with other, known pathogens. The proteomic method also could be applied to identify viruses and toxins, because viruses and toxins are included in the proteome database.
Naturally occurring environmental samples usually contain a great deal of organisms at very low concentrations in addition to the target species. The total amount of background organisms may consist of greater numbers than that of the target organism. Therefore, this is a topic that would challenge the method reported herein. This is being addressed by spiking a target organism in several environmental matrices at different applied amounts.
Improvement in sample preparation and mass spectrometry technologies will enhance and increase the number of peptides identified compared to those of the current methods. This can allow for MS proteomics being a valuable tool in conjunction with genomic approaches to address the issue of the identification and classification of microorganisms. Overall, these studies showed that the proposed MS-based proteomic approach is a useful method that may be applied to diverse biothreat scenarios and has the potential for bacterial differentiation and identification at species and strain levels of individual bacteria or their mixtures.
Published ahead of print on 2 April 2010.