|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Identification of post-translationally modified proteins has become one of the central issues of current proteomics. Spectral library search is a new and promising computational approach to mass spectrometry-based protein identification. However, its potential in identification of unanticipated post-translational modifications has rarely been explored. The existing spectral library search tools are designed to match the query spectrum to the reference library spectra with the same peptide mass. Thus, spectra of peptides with unanticipated modifications cannot be identified.
Results: In this article, we present an open spectral library search tool, named pMatch. It extends the existing library search algorithms in at least three aspects to support the identification of unanticipated modifications. First, the spectra in library are optimized with the full peptide sequence information to better tolerate the peptide fragmentation pattern variations caused by some modification(s). Second, a new scoring system is devised, which uses charge-dependent mass shifts for peak matching and combines a probability-based model with the general spectral dot-product for scoring. Third, a target-decoy strategy is used for false discovery rate control. To demonstrate the effectiveness of pMatch, a library search experiment was conducted on a public dataset with over 40 000 spectra in comparison with SpectraST, the most popular library search engine. Additional validations were done on four published datasets including over 150 000 spectra. The results showed that pMatch can effectively identify unanticipated modifications and significantly increase spectral identification rate.
Supplementary information: Supplementary data are available at Bioinformatics online.
Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is the key experimental method for large-scale protein identification. In this method, proteins are digested into peptides, which are then ionized and dissociated in a mass spectrometer. The mass-to-charge ratios (m/z) and the intensities of the resulting product ions are measured to produce MS/MS spectra. To identify the peptides and proteins, sequence database search has achieved great success in the past years, and a variety of search tools have been developed, e.g. SEQUEST (Eng et al., 1994), Mascot (Perkins et al., 1999) and pFind (Fu et al., 2004). Such an approach is implemented by comparing the similarities between the experimental spectra and the theoretical spectra predicted from peptide sequences in a database. Unfortunately, due to insufficient understanding of the factors that determine peptide fragmentation, most current search tools employ simplified fragmentation models, such as the uniform backbone dissociation model, leading to many unidentified or misidentified spectra. In recent years, with the availability of millions of confidently identified MS/MS spectra, an alternative as well as complementary approach called spectral library search has emerged. Its essential idea is to build a library of experimental reference spectra rather than theoretically predicted ones. Since this approach was first introduced to the field of protein identification by Yates et al. (1998), the last decade has witnessed a group of mass spectral library search tools, such as SpectraST (Lam et al., 2007, 2008), NIST MSPepSearch (http://peptide.nist.gov/), BiblioSpec (Frewen et al., 2006), X!Hunter (Craig et al., 2006), ProMEX (Hummel et al., 2007), HMMatch (Wu et al., 2007) and MSDash (Wu et al., 2008).
Compared to the sequence database search, the spectral library search takes advantage of the previously obtained knowledge and has three obvious merits. First, improved sensitivity. Spectral library search takes into account the fragmentation pattern individually for each experimental spectrum. It yields more discriminative match scores than does the sequence database search. Second, high search speed. Experiments show that in shotgun proteomics some peptides are detected all the time while some are never (Lam et al., 2008). Thus a well-organized spectral library consisting of empirically observed experimental spectra permits a smaller and more accurate search space. Third, convenient identification of extraordinary spectra, such as those produced from peptides with unusual post-translational modifications (PTMs). These spectra are big challenges to sequence database search engines, but could be identified as easily as the ordinary ones by spectral library search (Craig et al., 2006; Hummel et al., 2007; Lam et al., 2007; Wu et al., 2007). Apparently, the above merits are based on reliable and comprehensive spectral libraries. One of the main obstacles is library coverage (Lam et al., 2007; Yates et al., 1998). Many efforts have been made on library constructions, such as NIST (http://peptide.nist.gov/) and PeptideAtlas (http://www.peptideatlas.org/speclib/). However, it remains difficult considering that PTMs may generate substantial modified forms of a peptide. Note that there have been hundreds of known modifications (e.g. 512 entries recorded in the RESID modification database by February 26, 2010) and only a few of them, e.g. phosphorylation, were extensively studied in the past.
In fact, PTM mapping has become the central issue of current proteomics. The conventional sequence database search approach meets inevitable difficulties in PTM-centric data analysis, since the PTM types have to be explicitly specified by users. In this case, not only are some possible unanticipated PTMs missed, but also the number of the PTMs considered has to be restricted to avoid combinatorial explosion of theoretical peptides in all possible modified forms. To solve these problems, the mode of open search has been proposed, in which the peptide precursor ion mass tolerance is largely expanded and one or more modification masses are inferred to compensate for the peptide mass difference (Chen et al., 2009; Tsur et al., 2005). Such an approach does not require specifying PTM types and is able to identify spectra from peptides with unanticipated PTMs, though it still has some defects to overcome (e.g. low search speed). Also, Bandeira et al. (2007) developed a database-independent algorithm, named Spectral-Networks, to detect spectral pairs produced from modified and unmodified versions of the same peptide and identify the unanticipated modifications by propagating spectral annotations in the networks of related spectral pairs. However, the potential of applying the same idea to the spectral library search had not been explored until very recently. Ahrne et al. (2009) proposed a workflow to combine open library search with sequence database search to increase spectral identification rate, but the library search engine they used was not deliberately designed for the open search mode. Besides, a spectral matching algorithm Bonanza is sometimes considered as an open library search tool (Falkner et al., 2008; Menschaert et al., 2009), but it was actually devised in a clustering framework and it is unknown whether the methods in it are directly applicable to general library search, such as the method for false discovery rate (FDR) control.
There are three key issues that have to be addressed when designing an open library search tool. The first one is the shifted m/z values of the product ions carrying PTMs. One solution to this issue lies in the proper use of precursor ion mass differences between the spectral pairs to be matched; that is, the mass differences should be considered as the potential PTM masses, as done by some open sequence database search engines, e.g. PTMap (Chen et al., 2009). However, none of the current library search algorithms has considered it. Although Bonanza does allow a mass shift equal to the mass difference when matching product ion peaks, the mass shift value is roughly determined without considering the charge states of product ions. The second issue is how to use the sequence information behind library spectra. Although some of the current library search algorithms have tried some ways to use the sequence information by annotating the explained peaks in library spectra, they do not make the best of it, especially for scoring. Usually, only a proportion of the theoretical product ions are observed in an experimental spectrum. However, the omitted proportion may also be valuable, in particular for the open search where the changes of peptide fragmentation patterns caused by some unanticipated PTM(s) should be considered. The third issue is FDR control of search results. The FDR control methods used in current library search engines are not as mature as those used in sequence database search, e.g. the widely adopted target-decoy database search strategy (Elias and Gygi, 2007).
In this article, we present a dedicated open spectral library search tool, named pMatch, to identify unanticipated PTMs from MS/MS data. It is the first time, to our knowledge, that the issues mentioned above are comprehensively addressed. First, the library is constructed with spectra optimized by the full peptide sequence information to better tolerate the peptide fragmentation pattern variations caused by some PTMs. Second, a new scoring system is devised, which uses charge-dependent mass shifts for peak matching and combines a probability-based model with the general spectral dot-product for scoring. Third, a target-decoy strategy is used for FDR control. To demonstrate the effectiveness of pMatch, a library search experiment was conducted on a public dataset of standard proteins with over 40 000 spectra. Since no open library search tool is currently available, comparison was made with SpectraST, the most popular library search engine. As expected, pMatch significantly outperformed SpectraST in detecting unanticipated PTMs and increasing the number of identified spectra. Additional validations were done on four published datasets including over 150 000 spectra; a variety of PTMs were found and the spectral identification rates were increased to a large extent.
As an integrated library search engine, pMatch supports an entire workflow including library construction, spectral matching and result evaluation.
pMatch enrolls the identified raw spectra and makes full use of their corresponding sequence information to construct the library of ‘optimized’ consensus spectra.
At the beginning, consensus spectra are generated from duplicate spectra for redundancy removal. Here, the credibly identified raw spectra with the same peptide sequence, charge and modification states are assumed as duplicate spectra. To produce a consensus spectrum, the peaks from each raw spectrum have their intensities normalized such that the top intensity value is one. The common peaks (peaks from different spectra but with small differences in m/z according to the instrument precision, e.g. ±0.5 Th for ion trap) in duplicate spectra are combined into a consensus peak, with the averaged m/z and intensity values. Only those consensus peaks occurring in the majority of the duplicate spectra are retained. All the peak intensities are then rescaled by taking their square roots. This strategy has been demonstrated to lead to better performance in spectral similarity comparison (Liu et al., 2007; Stein and Scott, 1994).
Next, consensus spectra are optimized by incorporating the peptide sequence information to make theoretical peaks ‘bud’ (including those unobserved ones). As is shown in Figure 1, for each consensus (experimental) spectrum, a theoretical spectrum is generated with theoretical ion peaks (the b/y series product ions for collision-induced dissociation (CID) in this study) in the observed m/z range, with a uniform intensity value one. In each consensus spectrum, peak intensities are normalized making the top intensity value be one. Then, the intensities of the peaks in the theoretical and consensus spectra are, respectively, multiplied by the factor of θ (0 ≤ θ ≤ 1) and 1 − θ, and the two spectra are merged by superimposing their common peaks. Thus, the optimized consensus spectra are generated, with each explained peak annotated by its ion type, fragmentation position and charge state. This ‘budding’ strategy regains a part of sequence information that was lost in the experimental spectra. The optimized spectra emerge as a theoretical and experimental duality and are expected to tolerate the variations in peptide fragmentation patterns introduced by some PTMs.
The last procedure is to generate a group of decoy spectra with the same volume as the optimized consensus spectra, since pMatch uses a target-decoy strategy to evaluate its search results. The details of decoy spectrum generation scheme will be described later in this article.
Given a query spectrum, those library spectra with their precursor ion mass differences within a user-set tolerance and with the same charge state are selected as candidates for comparison. The precursor ion mass tolerance may be very large for the open search, e.g. ±300 Da. Finally, the candidate spectrum with the highest match score is assigned as the identification result of the query one.
Before matching, each query spectrum undergoes a simple preprocessing procedure. Isotopic peaks are removed and the peak intensities are rescaled by taking their square roots. At most the top 6 peaks per 100 Th are reserved for later matching.
To determine peak hits when matching two spectra, the precursor ion mass difference (which we call ΔM in the following parts of this article) is used to compute the allowed mass shifts for peak matching. Since the charge states of the explained peaks in library spectra are already known, the mass shifts could be accurately determined. The specific rules to find out peak hits are exhibited as follows. Peaks from the query spectrum are examined in the descending order of their intensities. If the query peak being examined has its m/z value mQ, and the user-set product ion m/z tolerance is Tp, then two sets of library peaks are selected:
The peaks from either S1 or S2 are chosen as candidate peaks if the ΔM is big enough to cause a PTM (say beyond ±0.5 Da); otherwise, only peaks from S1 are chosen. The most intensive candidate peak is finally determined as the hit peak to the query peak. Each peak can only be hit at most once.
As for spectral similarity scoring, pMatch employs two sub-scores: a spectral dot-product score and a probability-based score.
The spectral dot-product score (SDP_Score) is calculated as:
where IL and IQ denote the intensities of the library peak and the query peak, respectively.
For a query spectrum, there are usually several candidate library spectra (here we let the number be W). To determine whether one match ‘stands out’ from the remaining candidates, we use a probability-based score. A peak in a query spectrum is defined as a capital peak if its intensity is no less than 5% of the most intensive peak and is ranked in the top 10 in this spectrum. A hit between a capital peak and an explained library peak is called a mighty hit. Let n be the number of the capital peaks in the query spectrum, ki be the number of mighty hits in the match between the query and the i-th candidate spectrum, and mi be the number of explained peaks of the i-th candidate (the value of mi is doubled if mass shifts are triggered in the i-th match). Then the global average probability (p) that a capital query peak and an explained library peak make a peak hit can be calculated as follows:
For each capital peak in the query spectrum, the probability (P) that a mighty hit occurs by chance between it and one of the explained peaks in the i-th candidate library spectrum is:
The probability (P_value) that ki or more mighty hits occur by chance between the query and the i-th candidate library spectrum is:
The probability-based score, denoted by P_Score, is then calculated according to Equation (5). It evaluates the significance of a certain match on the basis of the statistic background of all candidate matches.
The final score of a match between a library spectrum and the query spectrum, as we call ‘pMatch_Score’, is the product of SDP_Score and P_Score:
After the library spectrum with the highest pMatch_Score is found, the location of the PTM on the peptide is assigned as follows. Each amino acid residue is assumed as the PTM site and a theoretical spectrum is predicted from the peptide with the PTM-containing product ion peaks shifted accordingly. Then this series of theoretical spectra are scored against the query spectrum using the common spectral dot-product. The highest scored site is accepted as the PTM location.
If a large set of query spectra are searched, then the control of FDR is necessary. Since the target-decoy search strategy has been the leading way to estimate the FDR of the sequence database search results, a natural idea is to extend it to the spectral library search. Yen et al. (2009) and Lam et al. (2010) have demonstrated the feasibility of using decoy spectra for FDR estimation in the spectral library search. Here, we extend this idea into the open search mode, and employ a similar approach for decoy spectra generation. For each optimized consensus spectrum in the library, a decoy spectrum is generated with the same precursor ion mass and charge state. Since the amino acid sequence is already known, a ‘pseudo-reversed’ (Elias and Gygi, 2007) sequence is made from the original peptide sequence; that is, the sequence of all the amino acid residues is reversed except the C-term one, by which means the enzyme digestion feature is reserved. Then, the corresponding decoy spectrum is born with explained peaks moved to the new m/z positions determined by their annotations and the pseudo-revered sequence.
pMatch filters search results by their pMatch_Scores and estimates FDR using the formula FDR = FP/TP, where FP and TP represent the numbers of matches to the decoy and original spectra, respectively. Importantly, for the open search mode, an issue that could produce considerable impact on spectral identification rate is the result filtration rule. The normal rule is to rank the whole result list by score and then calculate the estimated FDR. Because of the mass shifting strategy used in the open search, however, a pair of spectra with significant ΔM (where mass shifting works) raises the chance of peak hits, and thus are likely to produce a higher score compared to the pairs with insignificant ΔM. Obviously, the false positive identifications with significant ΔM would have higher chance to pass a uniform score cutoff. Therefore, a more reasonable filtration rule that we advocate is to group all the results into two lists according to their ΔM values, i.e. those with insignificant ΔM and those with significant ΔM. Afterwards, the results in the two groups are ranked separately for FDR estimation. This separate filtration rule is expected to increase the spectral identification rate compared to the normal rule.
In this study, a comparison experiment on a public dataset was carried out with detailed analysis between pMatch and SpectraST in both the conventional and the open search modes. To further validate pMatch, four additional published datasets were analyzed in the open search mode. The five datasets including ~200 000 spectra in total were all engaged in the same experimental workflow.
The five published datasets chosen in this study were from different species. The MS spectra were derived from high- or medium-precision instruments, as the use of high-precision instruments becomes the trend of proteomics development (Mann and Kelleher, 2008), and it is practical to gain more accurate PTM masses determined by the precursor ion mass differences. The brief summaries of the datasets are given below:
The way to construct the spectral libraries is similar to that proposed by Ahrne et al. (2009). This way has been demonstrated to be very effective in increasing the spectral identification rate of a dataset. First, the spectra in a dataset are searched against a protein sequence database. Then, the credibly identified spectra are accumulated to construct a spectral library, against which the remaining spectra are afterwards searched.
To identify some of the spectra for library construction, the pFind search engine (Fu et al., 2004; Li et al., 2005; Wang et al., 2007) (version 2.3) was used to search a target-decoy sequence database including the standard, pollution and background proteins (see Supplementary Data for the detailed description of the database). During searching, the precursor ion mass tolerance was set to ±50 ppm, and the product ion m/z tolerance was ±0.5 Th. Full tryptic specificity was applied, allowing up to two missed cleavage sites. Carbamidomethylation of cysteine was specified as a fixed modification, and oxidation of methionine as a variable one. After sequence database search, we observed that most of the identified spectra with high confidence had their precursor ion mass biases of around +2 ppm. The search results were then filtered with precursor ion mass deviation from −2 to +6 ppm at 1% FDR. Additionally, only those spectra from the proteins containing at least two unique detected peptides were reserved. Finally, a total of 12 032 identified spectra, including 577 unique peptides (with distinct amino acid sequences and PTMs) and 963 unique precursor ions (with distinct sequences, PTMs and charge states), were obtained and used to construct a spectral library.
In terms of the library search, pMatch (version 1.0) and SpectraST (version 3.1) were engaged with the same spectral source for library constructions and searches, and both the conventional and the open searches were carried out. For SpectraST, the precursor ion m/z tolerances were set to ±2 and ±150 Th, respectively, for the conventional and the open searches. The parameter to control the production m/z tolerance was 1 bin/Th (equal to ±0.5 Th). The search results were post-processed by PeptideProphet (Keller et al., 2002) for FDR estimation, as suggested by Lam et al. (2007, 2008). While for pMatch, in library construction, the θ in the ‘budding’ step was set to zero for conventional search to reduce spectral distortion and was set to 0.2 for open search to increase the robustness of the library. Given that the lowest charge state of the spectra in this dataset was 2+, the precursor ion mass tolerances were set to ±4 and ±300 Da, respectively, for the conventional and the open searches. The product ion m/z tolerance was ±0.5 Th. The FDRs of the search results were controlled by the target-decoy strategy with the normal filtration rule (not the separate filtration rule for a fair comparison).
The number of identified spectra from both engines at different FDR cutoffs are illustrated in Figure 2. It can be seen that compared to the conventional search, the open search significantly increased the number of identified spectra for both search engines. pMatch and SpectraST comparably performed in the conventional search. When it comes to the open search, however, pMatch identified nearly twice as many spectra as SpectraST through the whole FDR range considered.
In order to explore the differences between the two engines, a careful analysis was conducted on the results under 1% FDR. In the conventional search, as shown in Figure 3a, there were 2462 spectra identified by both engines, among which 2451 had agreeable matches and 11 conflicted. After manually validating the 11 spectra by taking a close-up view of their MS/MS spectra and tracing back to the corresponding MS spectra, we found all of the 11 query spectra were co-eluted spectra and their identifications by the two engines caught different components. Supplementary Figure S1 gives a typical example of a co-eluted spectrum. Unlike the conventional search where the two engines showed over 80% overlap between their results, in the open search, as revealed by Figure 3b, only <40% of the pMatch's identifications could be found in SpectraST's results, although the 13 disagreements all came from co-eluted spectra also.
As is discussed previously, each identification has the precursor ion ΔM as the potential PTM mass in the open search. The histograms of the ΔM values detected by the two engines are exhibited in Figure 4. As shown, some intensive ΔM detected by pMatch were not or rarely detected by SpectraST, for example, −128 Da (lysine loss), 22 Da (sodium adduct), 38 Da (calcium adduct) and 152 Da (carbamidomethylDTT). The crucial reasons should be that some modified spectra have a considerable percent of the observed peaks with their m/z values shifted and that some special PTMs might largely vary the fragmentation pattern of a peptide (see Supplementary Figure S2 for a spectrum with a sodium adduct and Supplementary Figure S3 for the influence of the ‘budding’ strategy on PTM detecting). However, neither did SpectraST consider the mass shifts caused by unanticipated PTMs during peak matching, nor made use of the sequence information to tolerate the peptide fragmentation pattern variations. On the contrary, SpectraST identified more spectra with very small absolute ΔM values (within ±5 Da), which mainly resulted from duplicate spectra, co-eluted spectra and spectra from deamidated peptides.
Then, we concentrated on the abundant ΔM (with ≥20 spectra for either engine) and manually validated some representative spectra. Nearly all of the abundant ΔM were explained (see Supplementary Table S1 for their frequencies and explanations). Among these ΔM, many PTMs were found (shown in Table 1); for example, a disulfide bridge was detected (shown in Figure 5). Additionally, some ΔM were caused by amino acid substitutions, or missed cleavages, or semi-digestions, while some corresponded to the combinations of two or more other ΔM values. Only two ΔM were not explained using our current knowledge. One of them had evidence supporting that there was indeed something happened on the peptides (see Supplementary Figure S4), while the other one might be a false positive.
In addition to those abundant ΔM, low-abundance ones also provided a wealth of information. Some of them corresponded to important PTMs, such as phosphorylation. pMatch and SpectraST identified 13 and eight spectra, respectively, with ΔM of 79.97 Da. These spectra are supposed to be derived from phosphorylated peptides.Figure 6 gives an example of such spectra.
For further validations, four additional published datasets were analyzed by pMatch in the open search mode, obeying the same workflow as above. The detailed search parameters are listed in Supplementary Table S2. To explore how much in the end pMatch could help to increase the spectral identification rate, here we used the separate filtration rule for FDR estimation. Table 1 shows the analysis results. For completeness, the result of the ISB-18mix dataset is also listed. We can see that the spectral identification rates significantly grew after library search and some interesting modifications were detected. For example, the ΔM of 12 Da detected in two datasets all occurred on peptide N-terms or basic amino acids. This modification is induced by formaldehyde (Toews et al., 2008), and has been recently detected in other datasets (Menschaert et al., 2009). Other detected PTMs include formylation (28 Da), acetylation (42 Da), methylation (14 Da), etc. Interestingly, in the Gygi-Qstar dataset, a number of spectra are identified with ΔM distributed from −20 to −3 Da. Many of them show no mass shift in product ions, compared with their matched library spectra (see Supplementary Figure S5), indicating that their precursor ion masses might have been incorrectly judged.
We have presented a novel spectral library search tool, pMatch, deliberately designed for the open search mode. Its ability to identify spectra with unanticipated PTMs was demonstrated on several datasets. In cooperation with traditional sequence database search, pMatch is able to push up the spectral identification rate to a large extent. The key points to contributing the success of this method lie in three aspects: the consideration of accurate mass shifts for peak matching; the use of full peptide sequence information for consensus spectral optimization; a new scoring function that combines the general intensity-based dot-product with a probabilistic model of peak matching.
The authors thank Dr Wilhelm Haas (Harvard Medical School) for providing the dataset Haas-Data in RAW format, and thank Dr Henry H.N. Lam (Department of Chemical and Biomolecular Engineering, HKUST) and Dr Stephen E. Stein (National Institute of Standards and Technology) for valuable discussions.
Funding: This study was supported by the National Natural Science Foundation of China under grant no. 30900262; the CAS Knowledge Innovation Program under grant no. KGGX1-YW-13; the National Key Basic Research and Development Program (973) of China under grant no. 2010CB912701; and the National High Technology Research and Development Program (863) of China under grant nos. 2007AA02Z315, 2008AA02Z309.
Conflict of Interest: none declared.