|Home | About | Journals | Submit | Contact Us | Français|
The analysis of the large volume of tandem mass spectrometry (MS/MS) proteomics data that is generated these days relies on automated algorithms that identify peptides from their mass spectra. An essential component of these algorithms is the scoring function used to evaluate the quality of peptide-spectrum matches (PSMs). In this paper, we present new approach to scoring of PSMs. We argue that since this problem is at its core a ranking task (especially in the case of de novo sequencing), it can be solved effectively using machine learning ranking algorithms. We developed a new discriminative boosting-based approach to scoring. Our scoring models draw upon a large set of diverse feature functions that measure different qualities of PSMs. Our method improves the performance of our de novo sequencing algorithm beyond the current state-of-the-art, and also greatly enhances the performance of database search programs. Furthermore, by increasing the efficiency of tag filtration and improving the sensitivity of PSM scoring, we make it practical to perform large-scale MS/MS analysis, such as proteogenomic search of a six-frame translation of the human genome (in which we achieve a reduction of the running time by a factor of 15 and a 60% increase in the number of identified peptides, compared to the InsPecT database search tool). Our scoring function is incorporated into PepNovo+ which is available for download or can be run online at: http://bix.ucsd.edu.
In the post-genomic era, focus has shifted towards the identification and characterization of all gene products that are expressed in a given organism.1 This large-scale analysis of proteins, dubbed Proteomics, contributes greatly to the understanding of gene function, biochemistry of proteins, processes and pathways.2 In this arena, tandem mass spectrometry (MS/MS) has emerged as the tool of choice for high-throughput identification of proteins and determination of details of their primary structures.3,4 Recent technological breakthroughs have led to a dramatic increase in the volume of MS/MS data being generated. However, analyzing this abundant data is not a trivial task. Without developing novel, more powerful algorithms, we can expect computational bottlenecks to restrict the scope of discoveries that can be made from experiments involving mass spectrometry.
MS/MS-based proteomics typically involves the enzymatic digestion of a protein sample, followed by separation and fragmentation of peptide ions in the mass spectrometer. The initial computational task performed when analyzing this data involves the identification of peptides from their mass spectra. While some algorithms perform this task by comparing query spectra to collections of identified spectra (e.g., spectrum libraries5,6) or theoretical spectra derived from a database of peptide sequences (such as the SEQUEST algorithm7), most algorithms rely on scoring functions to assess the quality of matches between a query spectrum and its possible peptide interpretations. A peptide-spectrum match (PSM) scoring function assigns a numerical value to a peptide-spectrum pair (P,S) expressing the likelihood that the fragmentation of a peptide with sequence P is recorded in the experimental mass spectrum S. The problem of scoring PSMs has received considerable interest in the community over the years.8-25 Most scoring functions create a generative statistical model for Prob(S|P), the probability that spectrum S was created from a fragmentation of peptide P. We can then chose P* = argmaxPProb(S|P) as the likeliest peptide that created S. The hope is, that if the model Prob(S|P) is sufficiently accurate, the peptide that maximizes Prob(S|P) will indeed be the true peptide whose fragmentation is recorded in S. The probability Prob(S|P) is often converted to a score by using a log ratio test that compares Prob(S|P) to the probability obtained from a simple null hypothesis, such as a model that assumes that all peaks are distributed randomly.8,13,16,17,19,22,23
The problem with using generative approaches to scoring PSMs is that peptide fragmentation is an extremely complex process, that is not easily represented with high fidelity by simple statistical models. Current scoring functions are usually sufficient when the search space is small, such as searching against a small set of protein sequences. However, when we increase the search space, by searching against a large database such as the six-frame translation of the human genome, or by performing de novo sequencing (which effectively searches the space of all peptides), the number good identifications typically decreases significantly. In these large search spaces, generative scoring models often lack the sufficient power to discriminate between the correct PSMs and the many close false ones (such as homeometric peptides that often exhibit very similar fragmentation patterns26).
It is worth mentioning that scoring functions for PSMs are used slightly differently with database searches and de novo sequencing. In de novo sequencing, we search the space of all peptides, and we assume that the correct peptide sequence is in our search space (with this statement we ignore the fact that a peptide might contain modifications since it is often too difficult to perform de novo sequencing in such cases without knowledge of the peptide's specific modification). Therefore the PSM scoring function has a simple ranking goal in this case; with each spectrum, bring the correct peptide to the top of the PSM candidate list. In contrast, when we score a PSM from a database search, we cannot assume that the correct peptide is necessarily in the database. In fact, in many cases, the database does not contain the correct peptide sequence for a query spectrum (e.g., the peptide might originate from an unknown gene or span an unknown splice boundary). Thus, a database search PSM scoring function has a more difficult ranking goal. In addition to bringing the correct PSM to top of the spectrum's candidate list, the scoring function also needs to (ideally) assign all correct PSMs a higher score than all incorrect PSMs. From a machine learning perspective, this can be viewed as classification task, in which we create a model that is trained to separate between the class of all correct PSMs and the class of incorrect ones. This task can be quite difficult because often PSM scores vary according to the quality of the spectrum.27 Thus, a correct match between a poorly fragmented peptide and spectrum A might score lower than an incorrect match between some other well fragmented peptide and spectrum B. Nonetheless, since their search space is typically small, database scoring functions usually cope with these more stringent requirements (they are rarely confronted with too many high-scoring incorrect PSMs). In addition, database scores are often post-processed,28,29 where additional factors are considered, such as the difference between the score of the top and second best PSMs (e.g., the ΔCn measure in SEQUEST's score7). Thus, in many cases, low-scoring PSMs are often accepted because the gap between the score of the first and second best PSMs is large enough. However, when the size of the database search space increases substantially, such as when searching a six-frame translation of a genome, there are many more strong but incorrect PSMs, and the database scoring function's performance deteriorates significantly. As we demonstrate below, a more powerful and discriminatory scoring function becomes essential in these circumstances.
The difference in the goals of scoring PSMs from de novo sequencing and database search has led us to choose a less conventional machine learning algorithm to train models. Instead of the more common generative probabilistic models, we take a discriminative ranking approach to scoring with the boosting-based RankBoost algorithm.30 Though we are not the first too use discriminative models for scoring PSMs (e.g., LOD scores15), this is typically not a common approach. By using discriminative algorithms, we do not attempt to train models that describe the process of peptide fragmentation (like the generative approaches), but rather, we optimize the models directly on the PSM scoring task at hand: distinguishing between correct PSMs and incorrect ones.
To train effective models we rely on a diverse set of features that capture various aspects of the quality of a PSM (e.g., the number of annotated peaks, the score of the peptide's path in a spectrum graph, etc.). We also use features that examine how well the observed spectrum fits predictions we make for the peptide's fragment peak ranks, using a novel ranking-based algorithm we developed.31 Though many of the features we use in our models are, each on their own, only slightly helpful at indicating if a PSM is correct or not, the RankBoost algorithm combines them into a powerful discriminatory scoring function for PSMs.
Much of the success of our algorithm can be attributed to the large volume of MS/MS data that is now available for training. We collected a set of ~320000 examples of unique peptide-spectrum pairs. With such a large dataset, the learning algorithm was able to create detailed models using hundreds of features without a fear of detrimental overfitting of the training data. While originally designed with de novo sequencing in mind, we show how our scoring approach can be effectively applied to database search results too.
Our method is effective for several PSM scoring tasks. When we used it to rerank the output of our de novo sequencing algorithm PepNovo,16 it significantly improved the performance over a strong de novo algorithm like Peaks.12 We also applied the score to the problem of peptide sequence tag generation, and were able to generate more accurate and longer tags, which improved the performance of a database search (making InsPecT17 runs approximately 15 times faster). We created specific scoring models for rescoring database search results, training them to behave more like a classifiers and not just pure rankers of PSMs. When we applied these models to post-process the output of InsPecT, it significantly increased the number of peptide identifications (a 20% increase when searching against a 30 million amino acid database). The improvement was much more significant when we applied our score to the results of a search against a six-frame translation of the human genome (~3 billion amino acids), where the number of identified peptides increased by 60%. This combination of both a significantly faster, and at the same time, much more sensitive database search, makes it practical to perform large scale proteogenomic analysis, even with large eukaryotic genomes.
Our experiments with scoring used a large set of ~ 320000 unique peptide-spectrums pairs collected from various MS/MS experiments using low-resolution CID ion-trap mass spectrometers. Most of the data we used was generated in the Briggs lab at UCSD (samples of from human HEK293 cell culture32 and samples from Dictyostelium discoideum33), and the Smith lab at PNNL (samples taken from Shewanella oneidensis MR-134,35).
We relied on the InsPecT database search tool to perform peptide identifications (release 20070613),17 using the default search parameters (precursor mass tolerance 2.5 Da, fragment ion tolerance 0.5 Da). All searches were performed using a shuffled decoy database.36,37 The InsPecT F-score threshold values for accepting identifications were selected to ensure a true positive peptide identification rate of 98% (i.e., only 2% of the peptide hits came from the decoy database). Since a peptide's charge and precursor mass can greatly influence the nature of its observed spectrum, we partitioned the training data into distinct sets as described in Table 1, and trained separate models for each partition.
Machine learning deals with algorithms that enable a computer to “learn” from data (inductive learning). A common machine learning task is classification, where a learning algorithm is used to derive a model for assigning a class label to each instance x from a domain space of instances (e.g., the Perceptron algorithm,38 Naïve Bayesian classifiers,39 Support Vector Machines40). However, classification algorithms are not always the most suitable framework; some problems have an inherent structure that suggests using other frameworks. For instance, a query to an internet search engine may return many webpages as answers. Usually, one cannot state that an answer to a query is completely right or completely wrong, rather a common approach is to assign a degree of relevance to each returned webpage. In such cases we can use a ranking algorithm which scores the answers on a gradient presenting the most relevant answers first (when the ranking algorithm is used to refine a previous ordering it is also called reranking). As we see below, ranking is also useful approach to solving problems involving PSM scoring.
We used the RankBoost algorithm of Freund et al.30 to train our ranking-based models. The RankBoost algorithm uses a machine learning method called boosting,41,42 which produces highly accurate prediction rules by combining many “weak” rules that, each on their own, may be only moderately accurate.
To train models with RankBoost we need to supply the algorithm with several inputs:
The goal of the RankBoost learning algorithm is to create a scoring function H : → that weights and combines feature functions from F, in order to induce a ranking that misorders as few pairs as possible in Φ. In our case, this means that the ranking function H is optimized to give correct PSMs a higher score than incorrect ones (when confronted with pairs of PSMs that involve the same spectrum S).
One of the benefits of the RankBoost algorithm is that it only incorporates into its models features that are useful (i.e., they help it make fewer ordering mistakes on the training data). RankBoost also has a high tolerance to uninformative features; we can supply it with a large pool of candidate features that can be correlated, and some might not even be relevant to the objective we are trying to achieve, nevertheless, in the training process, the algorithm performs its own internal “feature selection”, and incorporates only features that advance the goal.30 This property makes it easy for us to design a single set of feature functions (“one size fits all”) that incorporates all features that might be useful for peptide identification, without needing to consider whether the features are important for a specific model. For example, a feature that looks at the number of y+2 annotations might be very important for a model for scoring large triply-charged peptides, but only represent noise in a model that scores singly-charged peptides. With RankBoost we do not have to evaluate each model and decide which of the possible features are relevant to it, the algorithm does that automatically for us.
The most important component in our scoring models are the feature functions they use. Our models draw on a diverse set of features, created using domain knowledge, that each in their own way reflect different characteristics that can help distinguish between correct and incorrect PSMs. In total, the models can contain up to 225 features, though as explained above, not all get selected in each model. We grouped these features into different classes, as described below. A more detailed description of the feature functions can be found in the online supplemental material.
There are many chemical pathways that participate in peptide fragmentation and lead to the generation of spectra with very diverse characteristics. Better understanding of peptide fragmentation process is needed for more accurate and sensitive peptide sequencing algorithms.43,44 One way in which knowledge of peptide fragmentation “rules” is used to improve PSM scoring is by predicting theoretical spectra for candidate peptides, and comparing them to the observed experimental spectrum (e.g., checking that the two spectra share the same strong peaks, or high value for the dot-product between them). We note that unlike the method described in the SEQUEST algorithm,7 the theoretical spectra predicted with the aid of the additional chemical knowledge do not treat all b- and y-ions equally, but rather try to determine which specific peaks should be strongest (e.g., peaks that are N-terminal to proline). This gives more discriminatory power since the peaks in the experimental spectrum must not only appear in the correct mass locations, but they also need to have the correct “shape” (i.e., specific peaks need to have a stronger intensity). If the models governing the generation of the theoretical spectra are accurate, then there should be a high resemblance between the theoretical and observed spectra. Using this approach has led to improved peptide identification rates in database searches.24,45
Since predicting accurate MS/MS spectra is a difficult task,46 we chose to solve a slightly simpler problem, predicting a ranks of a peptide's fragment ions. This problem is formally described in Figure 1. For more details on our algorithm to solve this problem see ref.31
Once we have a prediction for the ranks of a peptide's fragment ions, we compare them with the ranks observed for the fragments in the experimental spectrum using several simple feature functions. For instance, such a function might report the observed rank (in the experimental spectrum) of a fragment predicted to be strongest according to our model, others may report the observed ranks of the fragments predicted to be second-strongest, third-strongest, etc. (the features focus mostly on the strongest peaks since those are predicted most accurately). We also created features that report if there are noticeable gaps in our rank predictions. For example, a feature can report how many of the peaks that are predicted to be strong, are actually missing in the experimental spectrum.
The space of all peptides is extremely large, making it inappropriate for an exhaustive case-by-case analysis. Nonetheless, most de novo algorithms are able to consider all likely peptides by modeling the search space for a query spectrum as a spectrum graph.8,47 A spectrum graph is a directed acyclic graph; its vertices correspond to putative prefix masses (cleavage sites) of the peptide. Two vertices are connected by a directed edge from the vertex with the lower mass to the one with a higher mass if the difference between them equals the mass of an amino acid. We use PepNovo's scoring function to score nodes in the graph.16 For each sequence peptide P we can assign a path in the spectrum graph; the score of the path is indicative of how likely it is that the observed spectrum was created from the fragmentation of P.
The spectrum graph feature functions examine several aspects that measure the quality of P's path in the graph. They report the path score, average path score (normalized according to the peptide length), the lowest score of a node in the graph (which can be indicative of a sequencing error), the rank of path in the graph, etc.
The spectrum graph scores evaluate combinations of fragments that involve specific cleavage sites. It is also beneficial to take a global look at how well the peptide explains the spectrum's peaks, like in the case of the aforementioned peak rank prediction features. With the peak annotation features, we look at more basic statistics that examine the quality of PSMs using functions that simply count a peptide's matched peaks. For example, these features look at the number of annotated peaks amongst the strongest 25 and 50 peaks in the experimental spectrum. Others report the proportion of explained intensity in the experimental spectrum, or count the number of b-,y-, y+2-ions, etc., that are observed in the given PSM.
When annotating fragment ions, we generally tolerate a mass differences of up to 0.5 Da between the expected mass of a fragment, as computed from the peptide sequence, and the actual mass observed in the spectrum. However, most of the true fragment peaks observed in spectra are much closer to their expected mass, usually being less than 0.1 Da away. A peptide that has many fragment peaks with a relatively large offset from their expected mass is likely to be relying on spurious opportunistic peak matches, and is therefore more likely to be incorrect. This type of peak offset information is most useful with the most abundant fragment ions, which are b,y, so offset related features focus only on them. Such feature will look at attributes like the average mass offset of all identified b- or y-ion peaks and the maximal mass offset for b- or y-peaks from their expected position.
Proteins are not random sequences of amino acids. They often contain conserved, or characteristic patterns that are responsible for inducing a specific spacial conformation or for providing certain function. In addition, certain amino acid patterns are more likely to be ionized and detected using MS/MS than others (e.g., basic amino acids are usually required for effective peptide ionization). These observations gave rise to the notion of proteotypic peptides,48-50 peptides that are most likely to be confidently identified by MS/MS methods. Many of the characteristics of proteotypic peptides can be captured using simple features that pertain to the peptide's amino acid composition (see supplemental information for more details).
We now turn to examine how our new scoring model can be used to improve the results of de novo sequencing and database searches. The supplemental material has additional experimental results, describing additional database search results with a standard protein mixture and experiments with peptide sequence tag generation. We first describe the process involved in training PSM scoring models for de novo sequencing.
In order to train our models, we used the partitioning of the training data into 13 sets according to charge and precursor mass, as described in Table 1, and trained a separate PSM scoring model for each partition. Each partition of the training data contained 13000-45000 spectra which were used to create about 400000 training samples as described below. We ran PepNovo on each of training spectrum Si, and retained the 2000 top-scoring sequences that were at least 6 amino acids long. If none of the 2000 de novo sequences was correct, we excluded the spectrum from the training set. Otherwise, we extracted the highest scoring correct peptide sequence and used it as the positive sample (the highest scoring correct sequence was usually also the longest correct sequence). From the remaining ~2000 incorrect de novo sequences we randomly sampled k = 10-30 sequences and used them as negative samples . Note that we did not sample the sequences uniformly from ranks 1-2000, rather we gave more weight to higher ranking sequences which typically are responsible for most of the ranking errors.
Using the feature functions described above, we created an instance in the feature space to represent the PSM of and Si, and similarly created instances to represent the PSMs of the incorrect de novo sequences. The intances and were used to create a set of k ordered pairs , which were added to the model's feedback function Φ (with all pairs in Φ having the same weight). We randomly selected 70% of the spectra's sets for training, while the remaining 30% served as a validation set which was used to determine when to terminate a model's training (to avoid overfitting).
Training each score model usually required less than 100 CPU hours, typically converging after less than 100000 training rounds. Figure 2 depicts the progression of the training of the model for doubly charged peptides with precursor masses 1100-1300 Da. The left side of the figure shows the training and validation errors (the error rate represents the proportion of the training PSM pairs in which an incorrect PSM scores higher than a correct one). Most of the ranking error is eliminated early on, falling from 45% to 10% within the first 50 rounds. The figure also shows that most of the error reduction is done using a small set of features; the graph on the right of the figure shows that approximately 25 features are required to achieve the aforementioned error reduction. Continuing to 10000 rounds lowers the validation error to 5.35%, which is only 0.22% higher than lowest validation error 5.13%, that is obtained after 90000 rounds. Therefore, by not seeking to fully optimize the model, we can save considerable time with the training. The graph on the left also shows that as the training progresses overfitting starts to become a problem (this is evident from the widening gap between the training error and validation error that does not decrease at the same pace). However, despite the overfitting, the overall validation error kept on decreasing, and we ended up choosing the model configuration that had the lowest validation error.
De novo sequencing of low-resolution MS/MS data is a difficult task. It is unreasonable to expect high accuracy rates from single de novo predictions, since often there are many similarly high-scoring candidates to choose from. Furthermore, some of the commonly used applications for de novo sequencing, such as database filtration with peptide sequence tags17,51-55 or homology-based database searches,56-58 actually perform better when supplied with multiple de novo predictions. It is in these circumstances that the advantage of ranking comes into play. Not only does our score increase the accuracy of the top predicted sequence, but it also significantly increases the chances of having a correct sequence in a small set of candidates.
We conducted several benchmark experiments to test the performance of our new scoring function in the context of de novo sequencing. We used several test datasets, including two test sets that were previously used in the literature:
Previous de novo sequencing benchmark experiments mostly focused on predicting a single sequence per spectrum.16,18,23,60 In these cases it made sense to look at the precision (ratio of correct amino acids in the predictions). However, when predicting multiple sequences, this notion is not well defined. Instead, we examine the proportion of test spectra for which one of the de novo predictions is completely correct, and also look at the rank in which a correct prediction first occurs.
Most publicly available de novo sequencing algorithms typically return a single sequence prediction. The newly developed MS-Dictionary algorithm61 takes a novel approach of combining de novo sequences and a database search. It uses dynamic programming and probabilistic scoring to generate large ranked lists (dictionaries) of possible peptides for query spectra. These peptides are then compared to a database via pattern matching. We examined MS-Dictionary's results with two settings, one that assumes that the peptides are tryptic (and thus lists only peptides that end with K or R), and the other that makes no assumption about which digestion enzyme was used. In addition, we ran experiments with the Peaks12 de novo sequencing algorithm (Peaks Online 2.0) which is one of the best commercial de novo sequencing algorithms available (in preliminary experiments it performed better than other programs such as NovoHMM18 and MSNovo23). The Peaks de novo algorithm only outputs 5 sequences per query spectrum.
Our experiments proceeded as follows. For each spectrum tested, we ran PepNovo and generated the top 2000 scoring sequences. We then reranked the sequences using the our novel scoring function. In the results described below we compare between the algorithm's performance with and without the reranking stage. On average the running time required per spectrum was 1-2 seconds, depending on the peptide's length. This running time usually divided equally between the de novo sequencing and the reranking. When the true peptide sequence was short (8-10 amino acids long), PepNovo typically predicted a complete sequence. However, with longer peptides, whose spectra are often incomplete and lacking detected fragment ion peaks for the amino acids near the terminals, PepNovo sometimes only predicted partial sequences (akin to long sequence tags).
Figure 3 shows benchmark results of the algorithms on the OPD280 and ISB769 datasets. In both datasets we see that PepNovo has significantly higher rates of correct predictions, especially when we look at a small set of de novo solutions. PepNovo uses a much more sophisticated scoring scheme than MS-Dictionary, which explains the large performance gap when small sets of predictions are concerned. In addition, MS-Dictionary is designed to predict only complete de novo sequences. Both the OPD280 and ISB769 datasets include some sequences longer than 14 amino acids (the length limit for which MS-Dictionary is deemed effective), which also explains the lower performance of this algorithm on these datasets. The Peaks algorithm displays accuracy rates that are slightly lower than PepNovo's without ranking. In Figure 3 we also see that reranking de novo sequences significantly increases the accuracy rates. There is an increase of 15-20% when considering small sets of predictions (1-10 sequences). The performance gap is still very significant for 50 and 100 predicted sequences, were the ranked PepNovo results practically reach the maximum they can attain (which is the value for the regular PepNovo results at 2000 sequences). With sets larger than 100 sequences, the gap naturally narrows until the two curves meet at 2000. These results show that the reranking is capable of taking correct, but poorly scoring sequences, and pushing them ahead in ranks. Often the sequences that get pushed forward are shorter than the top-scoring sequences returned by PepNovo. This phenomenon is especially common with spectra of peptides that have poor fragmentation. In such cases, the spectrum graph contains only a partial subpath that corresponds to the correct peptide, and this path frequently gets elongated with spurious edges. PepNovo ends up outputting these incorrect high-scoring sequences ahead of the lower-scoring correct ones. Our new ranking score detects many of these incidents and rectifies the ranks accordingly. This also explains why the average top reranked sequence tends to be shorter than the average top-ranked PepNovo sequence (see Table 2 for the average prediction lengths of the different algorithms).
Table 2 might suggest that PepNovo's superior performance could be attributed to its prediction of shorter incomplete sequences. To rule out this possibility, we benchmarked the algorithms on the task of predicting complete de novo sequences. In these experiments we discarded any prediction that did not span the entire mass range (this applies only to PepNovo and Peaks, since MS-Dictionary always predicts complete peptides). Note, that this puts PepNovo in a slight disadvantage compared to MS-Dictionary, since PepNovo's spectrum graph is not likely to contain a complete path for poorly fragmented peptides, while MS-Dictionary's search space includes all possible peptides.
We used several test sets taken form the HEK293 dataset32 in which all spectra belong to peptides with specific lengths: 8,10, and 12 amino acids. Figure 4 depicts the results of these experiments. For each peptide length, PepNovo's results are much more accurate than MS-Dictionary's (with as much as 30% more correct sequences for small sets of predicted sequences). Only when very large prediction sets are examined (200 sequences with length 8 and 500 sequences for length 10), does MS-Dictionary catch up with PepNovo's performance. These additional identifications made by MS-Dictionary belong to poorly fragmented peptides that do not have complete paths in PepNovo's spectrum graph. It is likely that PepNovo would be able to predict correct partial sequences in these cases. PepNovo's performance without ranking is at par with Peaks (Peaks has slightly better performance for length 8, while PepNovo has better performance with lengths 10 and 12). However, when PepNovo's results are reranked using our new scoring function, PepNovo exhibits a significant performance boost. The accuracy of the top predicted sequence rises by 10%-15%, and this gap is maintained even when we examine sets of 5, which is the maximal number of sequences generated by Peaks for each query spectrum.
Finally we note that PepNovo's superior performance at de novo sequencing can be harnessed to produce more accurate (and longer) peptide sequence tags for database filtration. PepNovo-generated tags can be 100 times more efficient than the ones used by InsPecT, which can lead to a 15-fold reduction in database search time (see results below). The supplemental material to this manuscript describes experiments we conducted that demonstrate PepNovo's improved tag generation capabilities.
We now turn to experiments that demonstrate how our ranking-based scoring function improves the performance of database searches. Training these models was done slightly differently than the training of the models for reranking de novo results. Instead of using incorrect de novo predictions for false PSMs, we used incorrect database search results (obtained from a run against a large set of shuffled protein sequences). This was done because the search space in a database search is much smaller than the space of all peptides, and thus generates “weaker” incorrect PSMs. In addition, we selected the training pairs of PSMs a bit differently, to make the ranking score become more classification oriented. Instead of having 100% of the pairs of PSMs come from the same spectrum (as was the case with de novo), we found that for database scoring, optimal results were obtained when only 20% of the pairs were selected this way. For the remaining 80% we used pairs of PSMs from different spectra (i.e., we added instances to the model's feedback function that ranked a correct PSM of spectrum S ahead of an incorrect PSM of spectrum S′). This way, we gave a higher weight to the classification-oriented goal of an ideal database scoring function, which aims to bring correct PSMs ahead of the incorrect PSMs from all other spectra, as opposed to the ranking-oriented goal of a de novo scoring function, that is just required to bring the correct PSM ahead of the incorrect PSMs from the same spectrum.
In the experiments described below we used the InsPecT database search tool17 in a variety of ways (see details below). This search engine is known perform much faster than standard commercial tools like SEQUEST (using a single CPU), and often produces a larger number of peptide identifications. The supplemental material to this manuscript contains experimental results that demonstrate the InsPecT identifies between 6% to 32% more peptides when searching ISB's standard protein mixtures.62 Our experimental results below demonstrate how our ranked-based scoring function can significantly improve upon InsPecT's search results.
To benchmark the performance of our new scoring method on standard MS/MS datasets, we chose an independent run from the HEK293 dataset,32 consisting of ~750000 spectra. We used InsPecT to perform the database search against the IPI human protein sequence database (version 3.42, containing ~30M amino acids). The searches were conducted in three different modes:
The final post-processing step of the database search, in all three running modes, was to filter the results to maintain a false discovery rate of 1% at the spectrum level, which corresponds to approximately 4% at the peptide level (i.e., 1% of the reported spectrum identifications and 4% of the reported peptide identifications are expected to be false positives).
The tags generated by PepNovo (used only in the “PepNovo Tags + Rank Score” mode) were a mixture of 3 tags of length 4, 35 tags of length 5 and 100 tags of length 6. This mixture of tags is approximately 100 times more efficient than the tags used by InsPecT's regular search that uses 25 tags of length 3 (see supplemental material). Since there are many fixed-time operations involved with the database search (file I/O, scanning the database, etc.), there is not a direct linear relationship between the tagging efficiency and the actual run-time speedup. Thus, PepNovo's 100 times more efficient tags led to an approximately 15-fold reduction in InsPecT's run-time.
Table 3 reports the results of these benchmark experiments. The table shows the number of spectra and peptides identified in each of the three search modes, and breaks down the results according to charge states. The total number of identified peptides is lower than the sum of the identifications made with charges 1-3 because often the same peptides were identified through several spectra with different charge states, and we only reported the number of unique peptide identifications. The maximal number of identifications was obtained using the method “PepNovo Tags + Rank Score” (18.8% more spectra and 22.9% more peptides than the regular InsPecT run). The largest increase in identifications was seen with charge 1, where the number of identified spectra rose by 90.5% and the number of peptides rose by 76.1% higher, compared to the number of identifications obtained with InsPecT. This indicates that InsPecT's scoring models do a poorer job with singly-charged peptides, compared to their handling of doubly-charged ones. The results for “InsPecT tags + Rank Score” also show a considerable improvement compared to the default InsPecT run, with an increase of 13.7% in the number of identified spectra and 14.7% in the number of identified peptides. When we consider these figures along with the improvement of +18.8% spectra and +22.9% peptides obtained with “PepNovo Tags + Rank Score”, we can conclude that almost 2/3 of the improvement of “PepNovo Tags + Rank Score” can be attributed to our improved scoring, while the rest of the gained identifications come from PepNovo's more accurate tags. Note that PepNovo's tags yield more identifications than InsPecT's tags despite the fact that they are 100 times more efficient. Interestingly, for triply-charged peptides, the results with “InsPecT Tags + Rank Score” are better than the results obtained with PepNovo's tags. This means that for triply-charged spectra InsPecT's tags are more accurate than PepNovo's. The reason for this is that triply-charged peptides typically have poor fragmentation, so in many cases it is quite difficult to extract long tags (4,5 or 6 amino acids long), while still relatively easy to get a good tag that is only 3 amino acid long.
Despite advances in genome sequencing and gene annotation algorithms, many genes remain unidentified even in the well-studied organisms.63,64 Annotation of genes using evidence of protein expression obtained via MS/MS experiments (“proteogenomic mapping”) is suggested as a complementary method to sequence-based gene prediction algorithms.32,35,65-73 Since proteogenomic studies involve searching mass spectra against all possible reading frames in a genome (a “six-frame translation”), the process can be quite time consuming when large eukaryotic genomes are investigated. In addition, the large search space encountered in proteogenomic studies leads also to lower sensitivity (fewer identifications) compared to searches against smaller protein databases.65,74
There have been several recent proteogenomic studies involving the six-frame translations of the human genome.67,69,71 However, these studies used relatively slow search programs such as SEQUEST and X!-Tandem,75 or relied on high-resolution FTMS to reduce the number of candidates that need to be considered.71 In our experiments, searching a single spectrum against a six-frame translation of the human genome required approximately 5 minutes of CPU time using InsPecT, which was benchmarked as being 10 times faster than X!-Tandem and 60 times faster than SEQUEST on a single processor.73 The recently developed MS-Dictionary algorithm61 is capable of performing rapid searches, on the order of a second per spectrum, against large sequence databases but the feasibility of this approach was only demonstrated on a small subset of MS/MS data (doubly-charged peptides 10-14 amino acids long).
Our novel ranking-based score helps ameliorate the two main deficiencies of proteogenomic mapping: speed and sensitivity. Using PepNovo-generated tags we are able to perform the database search significantly faster than the current state-of-the-art (approximately 15 times faster than InsPecT). In addition, reranking the results using our new scoring function significantly increases the number of identified peptides compared to the results obtained by a regular run of InsPecT.
We performed a benchmark experiment with the same HEK293 run used above to search the IPI sequence database. This time our sequence database was a six-frame translation of the human genome (NCBI build 35.3 masked using RepeatMasker), which contained approximately 3 billion amino acids - one hundred times larger than the IPI sequence database. The sequences in the six-frame translation were split into 40 files, and each sequence file was searched separately. The results of the 40 searches were then pooled and filtered, keeping the ten highest scoring peptide identifications for each spectrum. Similarly to the experiments with the IPI database, we ran three different types of searches: “Regular InsPecT”, “InsPecT Tags + Rank Score”, and “PepNovo Tags + Rank Score”.
Table 4 reports the results of these benchmark searches against a six-frame translation of the human genome. Like our experiments with the IPI database, we see a significant improvement when using PepNovo's tags and the new ranking-based score. However, the advantages of our new scoring become much more significant with the six-frame translation's challenging one hundredfold larger search space. There is a 61.3% increase in the number of identified peptides with “PepNovo Tags + Rank Score” search compared to a regular InsPecT run, and a 38.9% increase when only the reranking of results is applied. This increase is almost three times larger than the increase observed when we searched the IPI sequence database.
The total number of peptides identified in the six-frame translation is significantly lower than the number identified when searching IPI. However, while the regular InsPecT search losses 55% of its peptide identifications (it goes from 22518 peptides identifications down to 10356), the “PepNovo Tags + Rank Score” method fares significantly better, losing only 40% of its identifications (from 27685 peptides down to 16706). Similar reductions in the number of peptide identifications have been observed with previous proteogenomic experiments.74 One reason behind these reductions in identifications is the limited discriminatory power of the scoring functions. With a six-frame translation we encounter many more high-scoring incorrect PSMs compared to the number encountered when searching a significantly smaller search space. This reduces the number of positive identifications that can be accepted at a given false positive rate. For example, a 30% reduction was observed in the number of identified peptides that fell within exonic regions when switching from a protein sequence database to a six-frame translation of the genome of Arabidopsis thaliana.72 In addition, many of the peptides identified when searching protein sequence databases happen to fall on exonic boundaries. These products of splice events are not present in six-frame translations, and are bound to be missed. The number of such cases can be surprisingly large. In one case, 36.4% of the identifications made when searching MS/MS spectra against the human IPI sequence database belonged to peptides that spanned exonic boundaries.61
Table 5 compares between the peptide identifications made searching against IPI with the identifications made searching against the six-frame translation of the human genome. As expected, by comparing the second and third columns in the table, we see that the number of identifications made in the six-frame search is much smaller than the number of identifications made in the search against the one hundred times smaller IPI protein database. In fact, the fourth column shows that with all three search methods, a little more than a third of the peptides identified in the IPI search are lost because they do not appear in the six-frame translation (they span exonic boundaries). Many peptides that were identified in IPI and were also present in the six-frame translation, were not included in the final set of positive identifications form the six-frame search. The culprit in this case was the larger search space, which greatly increased the number of high-scoring incorrect PSMs. These additional high-scoring incorrect PSMs raise the scoring bar that needs to be passed for a PSM to be accepted as positive identification. Thus, these lost identifications can be directly attributed to deficiencies in the scoring functions. From this perspective, our novel ranking score performs significantly better than the default InsPecT scoring function. The fifth column shows that while a regular InsPecT run lost 5103 of the 22518 peptides it identified in IPI (22.7%), rescoring the InsPecT results using our novel score reduced this loss to 3037 from 25827 (11.8%). Interestingly, the search that used PepNovo tags lost only 2449 of the 27685 identified peptides (8.8%). The reason PepNovo's tags lose fewer peptide identifications is that these tags are 100 times more efficient than the tags used by InsPecT. Their higher filtration rate results in fewer high-scoring incorrect PSMs, which ultimately lowers the score threshold required to accept positive matches.
The last column in Table 5 lists the number of peptide identifications that are unique to the six-frame translation, representing products of unannotated genes. All three search methods show a similar ratio of these peptide identifications (between 5.7% and 6.9%). However since many more peptides got identified with “PepNovo Tags + Rank Score”, the number of novel peptides found with this method (1153), is significantly higher than the number found using a regular InsPecT search (625) or a rescored InsPecT run (828).
In this paper we explored how discriminative data-driven ranking models could be used successfully for the complex task of scoring peptide-spectrum matches (PSMs). We argue that this scoring problem is inherently a ranking problem, especially for de novo sequencing where the goal is simply to bring the correct PSM for spectrum ahead of all the incorrect ones (for the same spectrum). This led us to solve this problem with a machine learning ranking algorithm (RankBoost30), rather than use more common classification-oriented generative approaches.
When creating our scoring models, we had at our disposal a large set of diverse features which described many aspects that are known to be indicative of a poor or strong PSM. These features examined various attributes like the peptide's path in the spectrum graph, peak annotations (e.g., the numbers of b,y-ions that got annotated), the peptide's sequence (characteristics of proteotypic peptides), and more. An important source of features were based on predictions of peak ranks, that were done using a novel ranking-based algorithm we developed.31 Each of these features, in its own right, might be only marginally successful at discriminating between a correct and incorrect PSM; constituting what is called a “weak learner”. The RankBoost algorithm30 proved to be a very effective at combining this diverse set of features into powerful and discriminatory scoring function. In addition, by examining the models created by RankBoost we were able to gain insight into the dynamics and contributions of the various features (see supplemental material).
We demonstrated how our novel scoring function can be used to deliver superior performance in several MS/MS scoring tasks. By reranking the original order of the de novo results we were able significantly improved PepNovo's accuracy rates. This boosted the algorithm's performance well above the current state-of-the-art. For instance, when making a single sequence prediction, our reranked results are 10%-20% higher than the current high-performance algorithms (PepNovo and the commercial software Peaks). This performance gap persists even for larger sets of predictions (see Figure 3 and Figure 4).
Our novel score also greatly enhanced the accuracy of PepNovo's peptide sequence tags (see supplemental material). This enabled us to generate longer tags without compromising their ability to be a covering set (i.e., a set which contains at least one completely correct tag). The enhanced tagging capability both increased the accuracy of database searches and significantly reduced the running time, enabling us to speedup InsPecT's searches against a six-frame translation of the human genome by a factor of 15.
In addition to developing scoring models for de Novo sequencing, we trained specific models for rescoring database search results. When used to rescore results of InsPecT runs that searched MS/MS spectra against the human IPI protein sequences (~ 30 million amino acids), our new score was able to increase the number of peptide identifications by 14.7%. The increase grew to 22.9% when the search used PepNovo's tags instead of the default ones generated by InsPecT. However, the benefits of our novel scoring method were more substantial when applied to results of a search against a six-frame translation of the human genome (~ 3 billion amino acids). When we rescored InsPecT's results we witnessed a 38.9% increase in the number of identified peptides; using PepNovo tags along with the ranking-based scoring led to a higher increase of 61.3%. With our novel scoring method we also almost doubled the number of novel peptide identifications belonging to unannotated genes (increasing the 625 new peptides found in a regular InsPecT search to 1153).
Our experimental results underscore the fact that our models perform particularly well in challenging domains that have large search spaces. This trait becomes especially important when we start to consider more and more complex analysis tasks, such as searches that consider alternative splicing,32 large-scale “blind” searches,33,76 and even searches for fusion proteins.77 The search spaces in these domains can be so large, and contain so many high-scoring incorrect PSMs, that without more powerful scoring functions, it will be impossible to accept but a handful of the highest-scoring, and most obvious, identifications. Many interesting identification will be lost since they will lack statistical significance with current scoring methods. Our scoring function, which can be used as a stand-alone post-processing operation, can help increase the number of interesting discoveries made in such experiments.
We present novel approach to scoring of peptide-spectrum matches. We use discriminative boosting-based algorithms that are able to harnesses the large volume of MS/MS data presently available to train more accurate scoring models. Our new method improves the performance of our de novo sequencing algorithm beyond the current state-of-the-art, and also greatly enhances the performance of database search programs.
I would like to thank Natalie Castellana and Samuel Payne for helpful discussions and their assistance in running benchmark experiments. I am grateful to Pavel Pevzner form many inspiring discussions on the various topics covered by this manuscript. This research was supported by the “National Center for Research Resources” of the NIH via grant P-41-RR24851. I would also like to acknowledge the UCSD FWGrid Project for the availability of their computational infrastructure. The UCSD FWGrid Project is funded in part by NSF Research Infrastructure Grant number NSF EIA-0303622.