|Home | About | Journals | Submit | Contact Us | Français|
We have developed web-based software for the rapid identification of protein biomarkers of bacterial microorganisms. Proteins from bacterial cell lysates were ionized by matrix-assisted laser desorption ionization (MALDI), mass isolated, and fragmented using a tandem time of flight (TOF-TOF) mass spectrometer. The sequence-specific fragment ions generated were compared to a database of in silico fragment ions derived from bacterial protein sequences whose molecular weights are the same as the nominal molecular weights of the protein biomarkers. A simple peak-matching and scoring algorithm was developed to compare tandem mass spectrometry (MS-MS) fragment ions to in silico fragment ions. In addition, a probability-based significance-testing algorithm (P value), developed previously by other researchers, was incorporated into the software for the purpose of comparison. The speed and accuracy of the software were tested by identification of 10 protein biomarkers from three Campylobacter strains that had been identified previously by bottom-up proteomics techniques. Protein biomarkers were identified using (i) their peak-matching scores and/or P values from a comparison of MS-MS fragment ions with all possible in silico N and C terminus fragment ions (i.e., ions a, b, b-18, y, y-17, and y-18), (ii) their peak-matching scores and/or P values from a comparison of MS-MS fragment ions to residue-specific in silico fragment ions (i.e., in silico fragment ions resulting from polypeptide backbone fragmentation adjacent to specific residues [aspartic acid, glutamic acid, proline, etc.]), and (iii) fragment ion error analysis, which distinguished the systematic fragment ion error of a correct identification (caused by calibration drift of the second TOF mass analyzer) from the random fragment ion error of an incorrect identification.
Food-borne illness is a serious and continuing problem, with an estimated 76 million cases in the United States per year (http://www.cdc.gov). It is often caused by bacteria and viruses that are often ubiquitous in the environment and are difficult to eliminate due to their ability to adapt. In addition to the resulting morbidity, food-borne illness also has enormous societal costs, including losses in worker productivity due to illness, recall of food products determined (or suspected) to be contaminated, etc. Consequently, there is a critical need to develop rapid and sensitive methods for detection and accurate identification of food-borne pathogens.
A number of techniques have been developed for detection and identification of food-borne pathogens. A relatively recent technique for bacterial identification involves the use of mass spectrometry (MS). Because of its sensitivity and high specificity, MS has become a popular technique for chemicotaxonomic classification of microorganisms (16, 27). The use of MS in the analysis of microorganisms is a relatively recent application that was dramatically accelerated by the development of two ionization techniques in the late 1980s and early 1990s: electrospray ionization (15) and matrix-assisted laser desorption ionization (MALDI) (24, 37). When coupled with time of flight (TOF) MS, MALDI has been demonstrated to be a powerful tool for “fingerprinting” microorganisms by ionization and detection of proteins from intact bacterial cells or extracts resulting from bacterial cell lysis (1, 2, 3, 8-12, 19, 21, 25, 26, 29, 34, 40, 41, 42). Typically, MALDI-TOF MS “fingerprinting” of microorganisms involves analysis using either pattern recognition or bioinformatic algorithms.
Pattern recognition analysis compares MALDI-TOF MS spectra of samples of unknown microorganisms to spectra of known microorganisms. A high degree of similarity between the MS spectrum of an unknown microorganism and an MS spectrum of a known microorganism strongly suggests the identity of the unknown microorganism (22, 39, 43). It should be noted that pattern recognition analysis does not rely on actual identification of the biomarker ion peaks in an MS spectrum. It is the pattern generated by multiple ion peaks that constitutes a microorganism's “fingerprint.” The actual identities of individual ion peaks are not specified, and the peaks could be peaks for any of a number of possible biological molecules generated by a microorganism, including proteins, nucleic acids, lipids, etc.
Microorganism identification by bioinformatic analysis of MALDI-TOF MS data involves using the protein molecular weights (MWs) in bacterial genomic databases to assign biomarker ion peaks in a mass spectrum to specific proteins (4, 5, 32, 33, 45). If a significant number of biomarker ion peaks in a mass spectrum correspond to protein MWs for the open reading frames of a microorganism's genome, then the microorganism is considered identified. Such an analysis has also incorporated the simplest and most common posttranslational modification (PTM) observed for bacterial proteins, N-terminal methionine cleavage (5). It should be noted, however, that “identification” of a microorganism relies solely on a sufficient number of protein MWs derived from open reading frames of its genome corresponding to the m/z of biomarker ions in a MALDI-TOF MS spectrum. However, the protein MW alone is not sufficient to definitively identify a biomarker ion as a specific protein. Protein biomarkers are considered to be tentatively assigned instead of definitively identified.
Analysis of samples containing multiple bacterial organisms presents increased challenges for MALDI-TOF MS when protein MW is the sole criterion for protein biomarker identification. Clearly, it would be advantageous if researchers could obtain more information about a biomarker in addition to its MW. In the case of protein biomarkers, this can be accomplished by enzymatically digesting a protein in solution and analyzing its tryptic peptides by MS (peptide mass mapping) or by tandem MS (MS-MS) (sequence tags) (45). Alternatively, it is possible to fragment mature, intact proteins (without digestion) in the gas phase to obtain sequence-specific and PTM information. This approach is referred to as top-down proteomics. Until recently, top-down proteomics was possible only if Fourier transform ion cyclotron resonance MS involving complicated gas phase ion dissociation techniques was used (6, 23).
Although not originally designed for top-down proteomics, recently developed MALDI-tandem TOF (MALDI-TOF-TOF) MS was shown to fragment small or modest-size proteins (5 kDa > molecular mass < 15 kDa) without prior digestion (28). Demirev and coworkers (7) identified Bacillus atrophaeus and Bacillus cereus spores by fragmenting their protein biomarkers using a MALDI tandem mass spectrometer and analyzing the sequence-specific fragment ions generated by comparison to in silico fragment ions derived from protein amino acid sequences from genomic databases. Protein and microorganism identities were determined using a probability-based significance-testing algorithm (P value). The P value algorithm calculates the probability that a protein or microorganism identification occurred randomly. The smaller the P value, the lower the probability that an identification occurred randomly. The data analysis was performed using software developed in house (7).
In the current study, web-based software and databases, developed in house at the U.S. Department of Agriculture (USDA), were used to identify 10 protein biomarkers from three pure strains of Campylobacter by sequence-specific fragmentation using a MALDI-TOF-TOF mass spectrometer. Many of the protein biomarkers had been identified previously by bottom-up proteomics techniques (9, 11, 12), which provided an excellent data set to test the accuracy and performance of the algorithms incorporated into the software. MALDI-TOF-TOF MS-MS fragment ions were compared with a database of in silico fragment ions derived from bacterial protein sequences. The sequence-specific MS-MS fragment ions were used to identify a protein and thus the source microorganism. A simple peak-matching mathematical algorithm, incorporated into the software, was used to score and rank protein and microorganism identifications. In addition, the P value algorithm of Demirev and coworkers (7) was also incorporated into the USDA software (available with execution of appropriate control usage agreement) for comparison to the peak-matching algorithm. The peak-matching algorithm correctly identified a protein biomarker among as many as ~1,400 possible bacterial proteins and gave rankings for protein identification comparable to the rankings obtained by more complicated and computationally intensive P value calculation. We often observed enhancement of the score for correct identification when results for MS-MS fragment ions were compared to results for residue-specific in silico fragment ions compared to non-residue-specific in silico fragment ions. In addition, the correctness of the algorithm's identification was, in certain cases, further confirmed by fragment ion error analysis which compared random error caused by false matches between MS-MS fragment ions and in silico fragment ions with the systematic error observed for correct matches due to drift in the calibration of the TOF mass analyzer (38).
Figure Figure11 shows a flow chart of the process used for identification of protein biomarkers and bacteria.
Bacterial proteins were extracted from Campylobacter bacterial cells using a technique that has been previously reported (9-12, 29). Briefly, Campylobacter upsaliensis strain RM3195, C. coli strain RM2228, and C. lari strain RM2100 were each cultured on nonselective growth media for 24 to 48 h. Bacterial cells were harvested with a 1-μl loop (an amount which corresponded to 109 cells) and transferred to a microcentrifuge tube containing 0.5 ml of extraction solvents (67% water, 33% acetonitrile, and 0.1% trifluoroacetic acid) and 40 mg of 0.1-mm zirconia-silica beads (BioSpec Products Inc., Bartlesville, OK). The tube was capped and agitated for 60 s with a bead beater, resulting in cell lysis. The tube was then centrifuged at 8,161 × g for 4 to 5 min in order to pellet insoluble cellular debris.
Samples were analyzed with a 4800 TOF-TOF proteomics analyzer (Applied Biosystems, Foster City, CA). A 0.5-μl aliquot of sample supernatant was mixed with an equal volume of a saturated solution of MALDI matrix and deposited onto a 384-spot stainless steel target. Two MALDI matrices were utilized: 3,5-dimethoxy-4-hydroxycinnamic acid (sinapinic acid), a “cold” matrix; and α-cyano-4-hydroxycinnamic acid, a “hot” matrix. Laser desorption ionization was accomplished using a pulsed solid-state YAG laser (repetition rate, 200 Hz; wavelength, 355 nm; pulse width, ~5 ns). Spectra were acquired in positive-ion mode for both MS (linear mode) and MS-MS (reflectron mode). In MS linear mode, after laser desorption ionization, ions were accelerated from the first source by delayed ion extraction at 20 kV, separated over an effective ion path length of 1.5 m, and detected with a multichannel plate detector. In MS-MS reflectron mode, ions were accelerated from the first source by delayed ion extraction at 8.0 kV. Ions were separated spatially and temporally in the first field free region. Ions of interest were mass selected with a timed ion selector (TIS) or mass “gate” based on their arrival time at the TIS gate. The TIS was used to mass isolate specific protein ions for fragmentation based on their mass-to-charge ratio (m/z). The TIS was operated with a “window” of either ±50 Da or ±100 Da. Mass-selected ions were decelerated to 1.70 kV prior to entry into a floating collision cell at 2 kV. Ions were fragmented either by high-energy collision-induced dissociation or by postsource dissociation. The target gas for high-energy collision-induced dissociation was filtered air. Fragment ions exiting the collision cell were reaccelerated to 15 kV. A Bradbury-Neilsen ion gate after the second source could be used to suppress (and thus exclude) the precursor ion signal. MS-MS data were collected with the precursor ion suppressor gate in both the “on” and “off” modes. A two-stage reflectron mirror assembly was operated at 10.910 kV (mirror 1) and at 18.750 kV (mirror 2). Both linear and reflectron multichannel plate detectors were operated at 2.190 kV. The effective ion path length from the second source to the reflectron detector was 2.3 m. The instrument was externally calibrated in linear mode with the following calibrants: bovine insulin (MW, 5,733.58), Escherichia coli thioredoxin (MW, 11,673.47), and horse heart apomyoglobin (MW, 16,951.55). The instrument was externally calibrated in reflectron MS-MS mode using the y fragment ions of glu1-fibrino-peptide B (MW, 1570.60) at m/z 175.120 and 1441.635. MS and MS-MS data were processed using commercially available instrument software (Data Explorer software, version 4.9). The software parameters are described in the supplemental material.
The peak-matching algorithm involves counting the number of MS-MS fragment ions whose intensity is equal to or greater than a relative intensity threshold (e.g., 2%). The algorithm then counts the number of in silico fragment ions whose m/z fall within a specified m/z tolerance (e.g., ±2.5 thomson [Th]) to that of the m/z of MS-MS fragment ions; i.e., it counts the number of “matches” between MS-MS fragment ions and in silico fragment ions for the two data sets. The number of “matches” is then divided by the total number of MS-MS fragment ions whose m/z are above the specified intensity threshold. The resulting number is then multiplied by 100% to obtain the peak-matching score, as follows: score = 100 × (number of MS-MS fragment ion peaks that “matched” in silico fragment ion peaks)/(number of MS-MS fragment ion peaks).
The peak-matching score has a theoretical range of 0 to 100%. Zero percent indicates that no matches were identified, and 100% indicates that every MS-MS fragment ion matched an in silico fragment ion for identification. A nonzero fragment ion m/z tolerance indicates that it is possible for an MS-MS fragment ion m/z to “match” the m/z of two (or more) in silico fragment ions (or vice versa). Such multiple matches are counted only once by the algorithm; otherwise, a score greater than 100% could be obtained. The highest-scoring protein or microorganism identification that is significantly higher than the second-highest-scoring protein or microorganism identification is a presumptive correct identification. “Significantly” is defined here as a relative difference between the scores of the highest-scoring identification and the second-highest-scoring identification of 15 to 20% or greater. For comparison, the more mathematically complicated P value algorithm, developed by Demirev and coworkers, was also incorporated into the USDA software. In brief, the P value algorithm calculates the probability that an identification occurred randomly. The lower the P value of an identification, the less likely that the identification occurred randomly. A confident protein or microorganism identification is one in which the P value of the “top” identification is significantly (typically several orders of magnitude) lower than the P values of the “runner-up” identifications (7).
The peak-matching and P value algorithms are completely independent, and the results of each calculation are displayed in the software window. Software functionality allows selective operation of one of the algorithms or both algorithms. Algorithm computation time is provided by the software. In addition, the protein and microorganism identifications can be “ranked” by either the peak-matching scores or the P values.
The peak-matching and P value algorithms described above are used under the assumption that the polypeptide backbone has an equal probability of fragmenting at every residue of the protein to produce the a, b, b-18, y, y-17, and y-18 fragment ions. However, it has been shown experimentally that singly protonated (charged) protein ions are more likely to fragment at aspartic acid (D), glutamic acid (E), and proline (P) residues (7, 28, 31, 44, 46). As discussed below, each in silico fragment ion is identified by its m/z, ion type and number, and the two amino acid residues on either side of the backbone cleavage site that resulted in formation of the fragment ion. Software functionality allows comparison of the m/z of MS-MS fragment ions to the m/z of all in silico fragment ions of a particular protein (i.e., a non-residue-specific comparison). Alternatively, MS-MS fragment ions can be compared to residue-specific in silico fragment ions (e.g., D-, E-, and P-specific in silico fragment ions). Residue-specific and non-residue specific comparisons are discussed in greater detail in the supplemental material.
Figure Figure11 outlines the process by which in silico bacterial protein sequences (and their associated fragment ions) were obtained. A detailed description of the construction of the in silico database (as well as software and database architecture and function) is given in the supplemental material. In brief, in silico bacterial protein sequences were downloaded using the TagIdent software at the ExPASy public website (http://ca.expasy.org/tools/tagident.html) for proteins having a pI in the range from 0.00 to 14.00 and a molecular mass that was within 5 Da of that of the singly protonated protein biomarker ion after removal of its proton charge. The searches were conducted using both the UniProtKB/Swiss-Prot (versions 55.5 to 56.0) and UniProtKB/TrEMBL (versions 38.5 to 39.0) databases. Bacterial protein sequences with possible PTM (e.g., N-terminal methionine cleavage, signal peptides, etc.) were also retrieved. Multiprotein sequence FASTA files obtained from the ExPASy website were processed using a beta version (version 8.01a5) of the commercial GPMAW software (Lighthouse Data, Denmark) to generate individual text files for each protein sequence which contain the in silico fragment ions, the protein name, the amino acid sequence, the average MW of the protein, and the taxonomic classification of the bacterium. Each in silico fragment ion is identified by its m/z, ion type and number (a, b, b-18, y, y-17, and y-18), and the two amino acid residues adjacent to the polypeptide cleavage site that resulted in formation of the fragment ion. Individual in silico text files were batch uploaded to the in silico database of the USDA software.
Figure Figure22 shows a typical MS spectrum of a bacterial cell lysate of C. upsaliensis strain RM3195 analyzed by MALDI-TOF-TOF MS in linear mode using the sinapinic acid matrix. Figure Figure33 shows a typical MS-MS spectrum of the protein biomarker ion at m/z 11138.9 shown in Fig. Fig.2.2. This protein biomarker had been previously identified by bottom-up proteomics as thioredoxin (12). Prominent fragment ions are identified by their m/z, ion type and number, and amino acid residues adjacent to the site of polypeptide cleavage that resulted in the fragment ion. As the spectrum shows, many of the fragment ions are the result of polypeptide cleavage adjacent to an aspartic acid or glutamic acid residue.
The protein biomarkers of the following three strains of Campylobacter were analyzed by top-down proteomics: C. upsaliensis strain RM3195, C. lari strain RM2100, and C. coli strain RM2228. Many of the protein biomarkers had been identified previously by bottom-up proteomics techniques (9, 11, 12).
Table Table11 shows the top five identifications for a protein biomarker of C. upsaliensis strain RM3195 at m/z 11138.9 (Fig. (Fig.2)2) analyzed by MS-MS using MALDI-TOF-TOF MS (Fig. (Fig.3;3; see Fig. S1 in the supplemental material) and compared to all in silico fragment ions of bacterial protein sequences having the same molecular mass as the biomarker (within 5 Da). This corresponded to 1,409 in silico bacterial protein sequences. The protein biomarker had been identified previously by bottom-up proteomics techniques as thioredoxin (12). The rankings for the top five identifications based on the USDA scores and the P value calculations are identical, and both algorithms correctly identify the protein as thioredoxin with an N-terminal methionine cleavage PTM and its source microorganism as C. upsaliensis strain RM3195. The computation time of the USDA peak-matching algorithm is ~35% shorter than the computation time of the P value calculation. Table Table11 also shows the top five identifications for an analysis that was the same as that described above above except that only D-, E-, and P-specific in silico fragment ions were used for comparison. The numbers of in silico bacterial protein sequences are identical. The top identification for both algorithms correctly identifies the protein and its source microorganism. There also is significant relative enhancement of the top identification score compared to the “runner-up” identification scores when a comparison of D-, E-, and P-specific in silico fragment ions is used instead of a comparison of all in silico fragment ions (Table (Table1).1). In addition, the “runner-up” identification is different for the non-residue-specific comparison (all in silico fragment ions) than for the residue-specific comparison. The computation time of both algorithms for the residue-specific analysis is lower than that for the non-residue-specific analysis. The USDA peak-matching algorithm is ~116% faster than the P value calculation for the D-, E-, and P-specific in silico fragment ion comparison. Fragment ion error analysis for this protein biomarker identification is described and discussed in the supplemental material.
Table Table22 shows the top five identifications for a protein biomarker of C. upsaliensis strain RM3195 at m/z 12855.3 (Fig. (Fig.2)2) analyzed by MS-MS using MALDI-TOF-TOF MS and compared to all in silico fragment ions of bacterial protein sequences having the same molecular mass (within 5 Da) as the biomarker. This corresponded to 1,315 in silico protein sequences. This protein biomarker had been identified previously by bottom-up proteomics techniques as the 50S L7/L12 ribosomal protein (12). The top identification of both algorithms correctly identifies the protein as ribosomal protein 50S L7/L12 (with an N-terminal methionine cleavage PTM) and its source microorganism as C. upsaliensis strain RM3195. The “runner-up” identification is the 50S L7/L12 ribosomal protein of C. coli strain RM2228, whose molecular mass differs from that of the ribosomal protein of C. coli strain RM3195 by only ~2 Da. The primary amino acid sequences of the C. upsaliensis strain RM3195 and C. coli strain RM2228 50S L7/L12 ribosomal proteins are shown in Fig. Fig.4.4. The homologies of these two sequences with respect to aspartic acid, glutamic acid, and proline residues are identical, which results in fragmentation channels with high levels of similarity (13, 14). However, variations in non-D, non-E, non-P amino acid residues between these sequences result in “shifts” in the m/z of some in silico fragment ions, allowing differentiation of the these two proteins in a comparison of MS-MS data to in silico data. It is also interesting that there was incorrect identification of the 50S L7/L12 protein of Prosthecochloris aestuarii strain DSM 271 (ranked fourth by the USDA score and fifth by the P value), which highlights the fact that, although these high-copy-number housekeeping proteins have nearly identical MWs, their amino acid sequences are significantly different for protein and source identification by MS-MS analysis. Table Table22 also shows the top five identifications for an analysis that was the same as that described above except that only D-, E-, and P-specific in silico fragment ions were used for comparison. The top identification for both algorithms correctly identifies the protein and its source microorganism. Again, there is enhancement of the top (and correct) identification compared to the “runner-up” identification when a comparison of D-, E-, and P-specific in silico fragment ions is used instead of a comparison of all in silico fragment ions. The computation time of both algorithms is lower for a residue-specific in silico comparison than for a non-residue-specific in silico comparison. However, the analysis time for the USDA peak-matching algorithm was reduced by ~60%, whereas the analysis time for the P value calculation was reduced by only ~20%.
Identifications of other protein biomarkers of C. upsaliensis strain RM3195 are shown in the supplemental material.
Table Table33 shows the top five identifications for a protein biomarker of C. lari strain RM2100 observed at m/z 11253.3 obtained by MALDI-TOF MS and analyzed by MS-MS using MALDI-TOF-TOF MS. The MS-MS fragment ions were compared to all in silico fragment ions of bacterial protein sequences having the same molecular mass (within 5 Da) as the biomarker. This corresponded to 1,548 in silico bacterial protein sequences. The protein biomarker had been identified previously by bottom-up proteomics techniques as thioredoxin (11). The top identification of both algorithms correctly identifies the protein and its source microorganism. Table Table33 also shows the results of an analysis that was the same as that described above except that only D-, E-, and P-specific in silico fragment ions were compared to MS-MS fragment ions. Again, the top identification of both algorithms correctly identifies the protein and its source microorganism. There is also enhancement of the top identification score compared to the “runner-up” scores when the residue-specific analysis results are compared to the non-residue-specific analysis results. The peak-matching algorithm is ~40% and ~70% faster than the P value calculation for the comparisons of all in silico ions and residue-specific ions, respectively. In addition, the computation time for the peak-matching algorithm is cut in half for the residue-specific comparison, whereas the P value computation time is slightly increased compared to the non-residue-specific analysis time.
Identifications of other protein biomarkers of C. lari strain RM2100 are shown in the supplemental material.
Table Table44 shows the top five identifications for a protein biomarker of C. coli strain RM2228 at m/z 8571.4 obtained by MALDI-TOF MS and analyzed by MS-MS using MALDI-TOF-TOF MS. The MS-MS fragment ions were compared to all in silico fragment ions of bacterial protein sequences having the same molecular mass (within 5 Da) as the biomarker. This corresponded to 1,425 in silico bacterial protein sequences. This protein biomarker had been identified previously as the DUF-465 protein in another strain of C. coli by bottom-up proteomics techniques (11). The top identification of both algorithms correctly identifies the protein and its source microorganism. Table Table44 shows the results of an analysis that was the same as that described above except that only D-, E,- and P-specific in silico fragment ions were compared to MS-MS fragment ions. Again, the top identification of both algorithms correctly identifies the protein and source microorganism. In addition, there is enhancement of the USDA score and P value of the top identification compared to the “runner-up” identification when the residue-specific analysis results are compared to the non-residue-specific analysis results.
Identifications of other protein biomarkers of C. coli strain RM2228 are shown in the supplemental material.
The algorithms and software were tested using MS-MS data whose quality was variable. This reflected, in part, a gradual increase in our skill at acquiring MS-MS data for intact proteins using the MALDI-TOF-TOF instrument (an application for which this instrument was not originally designed). Our initial MS-MS data were not as good as the data collected in our later MS-MS experiments. However, it seemed useful to test the software with both high-quality and lower-quality MS-MS data. This approach was facilitated by the fact that most of the protein biomarkers identified by MS-MS had been identified previously by bottom-up proteomics (9, 11, 12) and so provided an excellent data set to test the limits of algorithm and software identification. Table Table55 summarizes the quality of MS-MS spectra analyzed in this study. Two criteria were used to evaluate the MS-MS spectra qualitatively: (i) the number of prominent fragment ion peaks observed in the MS-MS spectrum (which is proportional to the fragmentation efficiency of the protein) and (ii) the noise background of the MS-MS spectrum. Typically, a higher-intensity-threshold cutoff was used for MS-MS spectra that exhibited a noisier baseline. The noise background was not necessarily uniform over the entire m/z range, which contributed to the problem of selecting the optimum intensity threshold to apply over the entire spectrum. Although increasing the intensity threshold cutoff can reduce chemical noise contributions from a noisy baseline, it may also eliminate genuine low-intensity fragment ions that are prominent in a less noisy part of the MS-MS spectrum. Not surprisingly, higher-quality MS-MS data resulted in higher-scoring correct identifications, whereas lower-quality MS-MS data resulted in lower-scoring correct identifications (or incorrect identifications). For the lower-quality MS-MS data, in some cases it was necessary to restrict the in silico comparison to residue-specific in silico fragment ions (e.g., D,- E,- and P-specific or D-specific in silico fragment ions) in order to obtain a top-scoring correct identification. Presumably, a non-residue-specific comparison (i.e., a comparison of all in silico ions) is likely to have an increased probability of random in silico matches to chemical noise peaks, which may contribute to the greater difficulty of correctly identifying the protein from poorer-quality MS-MS data. By narrowing the in silico comparison to only the in silico fragment ions that have the highest probability for formation (D-, E-, and P-specific or D-specific in silico fragment ions), many random (false) in silico matches are eliminated, resulting in a more prominent score for the correct identification.
It should be noted that MS-MS fragment ion intensity per se is not used (by either algorithm) as a criterion for comparing MS-MS fragment ions to in silico fragment ions. Only m/z are compared. However, a minimum intensity threshold is applied to the relative intensities of MS-MS fragment ions. This intensity threshold was determined ad hoc based on the amount of baseline noise of the MS-MS spectrum after processing (but prior to centroiding). Although the absolute (or relative) intensities of fragment ions are not directly involved in algorithm calculations, one would expect that a “correct” identification by an algorithm should match a greater number of prominent fragment ions than the top incorrect identification by the algorithm. This is shown in Fig. S9 in the supplemental material. MS-MS fragment ion peaks whose relative intensities only slightly exceed the intensity threshold may or may not be caused by chemical noise. “Matches” of in silico fragment ions to the lowest-intensity MS-MS fragment ions are less significant from an MS standpoint than matches to other more prominent fragment ions. However, neither algorithm discriminates on the basis of fragment ion intensity as long as the ion peak intensity is above the preset threshold.
In addition to the problem of random in silico matches to chemical noise peaks, fragment ions from multiple protein biomarkers can increase the difficulty of identifying individual protein biomarkers. As noted previously, the nearly identical MWs of the 10,000-MW chaperonin (average MW, 9,617.3) and cytochrome c (average MW, 9,617.0) of C. lari strain RM2100 means that it is not possible to isolate these ions on the basis of m/z; i.e., fragment ions from both proteins are detected (13, 14). Consequently, fragment ions from cytochrome c probably contributed to the difficulty of identifying the 10-kDa chaperonin using a comparison of all in silico fragment ions (see Table S5A in the supplemental material). As mentioned previously, the protein sequence for cytochrome c was not included in the in silico database because the mature protein polypeptide is covalently linked with a heme group (MW, 616.5), making in silico identification complicated. Consequently, the MS-MS fragment ions of cytochrome c could not be correctly matched to their in silico sequence; however, they could be incorrectly matched to in silico fragment ions of other protein sequences in the database (i.e., false or random matches). Consequently, use of D-, E-, and P-specific in silico comparison and then D-specific in silico comparison narrowed the MS-MS in silico comparison to only the in silico fragment ions that have the greatest probability for formation. This may significantly reduce the number of random matches and result in a top score correctly identifying one of the protein biomarkers (i.e., the 10-kDa chaperonin) (see Tables S5B and S5C in the supplemental material). However, although the 10-kDa chaperonin was correctly identified with the highest USDA and P value scores in a D-specific in silico comparison, the top score is still “grouped” with the “runner-up” scores (see Table Table5C5C in the supplemental material). This suggests the importance of using ion isolation for restricting MS-MS analysis to a single protein whenever possible.
Our analysis in the current study indicated that the simple peak-matching algorithm and the more complicated P value algorithm of Demirev and coworkers appear to perform fairly well for either a non-residue-specific in silico comparison or a D-, E-, and P-specific or D-specific in silico comparison. In 2005, Demirev and coworkers (7) reported testing their algorithm for only non-residue-specific in silico comparisons; i.e., all possible in silico fragment ions (a, b, and y ions with up to two small neutral losses [NH3 or H2O]) were compared without regard to the residues adjacent to the sites of polypeptide cleavage responsible for the in silico fragment ions formed (7). Our analysis using both the peak-matching and P value algorithms suggests that a D-, E-, and P-specific or D-specific in silico comparison can reveal a correct identification that is not always apparent from a non-residue-specific in silico comparison. This is particularly apparent in the analysis of MS-MS data whose quality is marginal (see Table S2 in the supplemental material) or of MS-MS data for fragment ions that cannot be correctly “matched” to in silico ions because of PTM of the mature protein (see Table S5 in the supplemental material).
The relative computational efficiency of an algorithm may play an increasingly important role as the number of in silico bacterial proteins increases due to the increasing number of bacterial genomes in public and private databases. The USDA peak-matching algorithm is mathematically much simpler than the P value formula. Not surprisingly, P value calculation is computationally more intensive and thus requires more time than the USDA algorithm, especially as the number of MS-MS fragment ions increases. The disparity in computation time between the two algorithms becomes more apparent as the number of MS-MS fragment ions increases. In the P value formula (7), the number of MS-MS fragment ions is designated “K,” the number of “matches” is designated “k,” and the number of in silico ions is designated “n.” The unexpected increase in computation time for P value calculation for a D-, E-, and P-specific analysis (Table (Table3;3; see Table S1B in the supplemental material) compared to a non-residue-specific analysis (Table (Table3;3; see Table S1A in the supplemental material), where the values of K are 79 and 69, respectively, is probably due to the calculation of factorials and powers used in the P value formula [e.g., (K − k)!]. Although fewer in silico ions (n) are compared to MS-MS fragment ions for a D-, E-, and P-specific analysis than for a non-residue-specific analysis, the number of “matches” may also decline, resulting in an increase in computation time for calculating (K − k)!.
Identification of bacteria (or other microorganisms) using sequence-specific fragmentation of their protein biomarkers is dependent on the availability and accuracy of the genomic information from which the in silico protein amino acid sequences are derived. In order to test the algorithms and software, we examined genomically sequenced strains of Campylobacter whose protein biomarkers had been identified previously by bottom-up proteomics techniques. Although the software and algorithm were not specifically designed to identify unknown (nongenomically sequenced) bacterial strains, the usefulness of this technique would be enhanced if unknown bacterial strains could also be identified. The ability to identify an unknown (nongenomically sequenced) bacterial strain using this technique would be dependent on the extent of sequence homology between the unknown strain and a genomically sequenced strain. A protein sequence from an unknown strain may contain amino acid substitutions compared to the same protein sequence from a genomically sequenced strain. These substitutions may result in a protein molecular mass that is outside the range specified in the initial protein search (±5 Da) of genomic and proteomic databases. However, it may still be possible to identify such proteins by expanding the protein molecular mass range for search and retrieval (e.g., ±50 Da). This would greatly expand the number of proteins retrieved from public database and uploaded to the in silico protein database. It would also allow possible protein identification from partial sequence homology between the protein sequence of an unknown strain and the protein sequence of a genomically sequenced strain. The likelihood of identification would depend on the number and location of the amino acid substitutions. The number of amino acid variations is dependent on the phylogenetic distance between the unknown and genomically sequenced strains (10). The more closely related the two strains are, the fewer the amino acid substitutions and the greater the probability that a protein from an unknown strain could be identified based on its sequence homology to a protein from a genomically sequenced strain (10).
In the current study, the software was tested by using identification of protein biomarkers from pure bacterial strains. However, the only limitation of this technique for its application in analysis of bacterial mixtures is the resolving power of the TIS, which is used to isolate specific protein ions on basis of their m/z. Currently, the narrowest TIS “window” obtainable with the TOF-TOF instrument is ±50 Da at 10 kDa. If two protein ions (either from a single bacterial strain or from multiple strains) are separated in m/z by 50 Th (or more), then it is possible to mass isolate (resolve) these two protein ions and identify each protein from the fragment ions generated. However, if two protein ions are separated in m/z by less than 50 Th, the TIS is not able to isolate the two protein precursor ions, and fragment ions from both precursor ions may be detected (although this also depends on the fragmentation efficiency of the two protein ions). Software analysis of fragment ions from multiple precursor ions may result in “runner-up” identifications that reflect correct MS-MS-in silico matches that are different from MS-MS-in silico matches of the top identification.
We have developed web-based software for rapid top-down proteomic identification of small proteins (and their source bacterial microorganisms) from analysis of MS-MS fragment ions of intact bacterial proteins generated using MALDI-TOF-TOF MS. A simple peak-matching algorithm was used to score and rank identifications of proteins and microorganisms by comparing MS-MS fragment ions to in silico fragment ions generated from bacterial protein sequences derived from genomic databases. The P value algorithm of Demirev and coworkers was also incorporated into the software for purposes of comparison. The algorithms and software were successfully tested with protein biomarkers of species and strains of Campylobacter that had been identified previously by bottom-up proteomics techniques. A database of in silico fragment ions was constructed for bacterial protein sequences whose calculated MWs corresponded to the m/z of a protein biomarker observed in MALDI-TOF MS spectra. In silico fragment ions were identified by m/z, type, number, and the amino acid residues adjacent to the site of polypeptide fragmentation resulting in a fragment ion. Consequently, MS-MS fragment ions could be compared to in silico fragment ions without regard to the residues adjacent to the site of fragmentation (i.e., non-residue-specific comparison), or MS-MS fragment ions could be compared only to the in silico fragment ions that were formed as a result of polypeptide fragmentation adjacent to specific residues (i.e., residue-specific comparison). A D-, E-, and P-specific or D-specific analysis often enhanced the top identification score (correct identification) relative to the scores of the “runner-up” identifications compared to the top identification score for a non-residue-specific analysis. In some cases, a protein biomarker was successfully identified by a residue-specific analysis when a non-residue-specific analysis failed to correctly identify the protein or its source microorganism. The success of D-, E-, and P-specific or D-specific in silico analysis for identification confirms the importance of these residues in the fragmentation of singly charged (protonated) proteins. Although the relative intensities of protein biomarker ions are not explicit criteria used in the algorithm, it is reasonable to expect that a correct identification should “match” many of the most prominent MS-MS fragment ions. This was found to be the case. Finally, fragment ion error analysis may be successfully used to confirm an algorithm identification by distinguishing systematic fragment ion error caused by drift in the TOF calibration from random error caused by random matches between MS-MS fragment ions and in silico fragment ions.
We thank Peter Højrup at Lighthouse for modifying his existing GPMAW software and providing it to us as a beta version. We also thank Christine Hoogland at Bioinformatics Institute of Switzerland for her assistance. We thank Linda C. Whitehand for statistical discussions and Robert E. Mandrell and Al Lastovica for providing Campylobacter strains RM3195, RM2100, and RM2228.
Mention of a brand or firm does not constitute an endorsement by the U.S. Department of Agriculture over other similar brands or firms not mentioned. This article describes U.S. Government work.
Published ahead of print on 1 May 2009.
†Supplemental material for this article may be found at http://aem.asm.org/.