|Home | About | Journals | Submit | Contact Us | Français|
Accurate modeling of peptide fragmentation is necessary for the development of robust scoring functions for peptide-spectrum matches, which are the cornerstone of MS/MS-based identification algorithms. Unfortunately, peptide fragmentation is a complex process that can involve several competing chemical pathways, which makes it difficult to develop generative probabilistic models that describe it accurately. However, the vast amounts of MS/MS data being generated now make it possible to use data-driven machine learning methods to develop discriminative ranking-based models that predict the intensity ranks of a peptide's fragment ions. We use simple sequence-based features that get combined by a boosting algorithm in to models that make peak rank predictions with high accuracy. In an accompanying manuscript, we demonstrate how these prediction models are used to significantly improve the performance of peptide identification algorithms. The models can also be useful in the design of optimal MRM transitions, in cases where there is insufficient experimental data to guide the peak selection process. The prediction algorithm can also be run independently through PepNovo+, which is available for download from http://bix.ucsd.edu/Software/PepNovo.html.
Tandem mass spectrometry (MS/MS) has emerged as the leading method for high-throughput protein identification and quantitation in proteomics.1,2 Recent technological breakthroughs have led to a dramatic increase in the volume of MS/MS data being generated. While analyzing this growing amount of data encounters computational bottlenecks – which need to be met with ever-improving algorithms – this surplus of the data also opens a window of opportunity: it allows us to train more sophisticated and powerful models to be used in the analysis.
Many traditional statistical analysis methods rely on creating probabilistic models that describe the process being studied. However, except for very elementary cases, such models tend to over-simplify the phenomenon they describe and are consequently inaccurate. The tremendous growth witnessed in recent years, both in the volume of experimental data that is collected and in the computational resources made available for its processing, have given rise to new machine learning-based analysis approaches. These new computational methods do not require us to explicitly understand the underlying mechanism behind the processes being modeled. Rather, they rely on finding patterns and correlations concealed in the data and then utilize them to draw insightful conclusions (this computational idea is at the heart of novel methods like metagenomics3).
In this paper we explore how a machine learning boosting algorithm4 (in the context of ranking5) can be used to leverage large amounts of experimental MS/MS data to create models for predicting the intensity ranks of a peptide's fragment ion peaks based only on the peptide's amino acid sequence. Using this data-driven machine learning method, we are able to create powerful predictive models without having to fully understand the complex dynamics of peptide fragmentation. Our models are constructed from simple (“weak”) sequence-based features that get combined by the boosting algorithm into a strong accurate prediction models. In addition, examining the values of the parameters in the trained models can gain insights on the dynamics of fragmentation pathways.
What makes our method effective, and is also an underlying assumption in all other machine learning-based MS/MS sequencing algorithms, is the fact that MS/MS spectra are reproducible. This makes them suitable targets for prediction-based algorithms. The reproducibility of mass spectra stems from the fact that they are generated by recording the stochastic fragmentation process of many molecules. Thus, repeated experiments with the same peptides produce similar looking spectra.
The reproducible nature of mass spectra makes it possible to use simple but powerful comparison-based methods like spectral libraries for peptide identification.6,7 While this method is accurate, it is not suitable for identifying many less common peptides (post-translationally modified peptides, products of miscleavages, peptides from rarely expressed proteins or alternative splice variants, etc.) To be able to identify all spectra using a similarity based method, one would have to be able to predict a theoretical mass spectrum for any peptide sequence. The predicted theoretical spectrum must be sufficiently similar to the observed experimental spectrum in order for the similarity based identification to succeed. However, due to the complex nature of peptide fragmentation is often difficult to make such predictions accurately. The popular Sequest algorithm,8 which relies on the cross-correlation between the theoretical spectrum of a database peptide and the experimental mass spectrum, skirts the issues involved in predicting theoretical spectra by relying on a naïve fragmentation model that assigns a fixed equal intensity to the major ions (e.g., b- and y-ions) and a lower fixed intensity to their neutral losses. In practice, most experimental spectra are not very similar to their Sequest-predicted theoretical spectra, which leads to missed identifications.
It is widely accepted that better understanding of peptide fragmentation is required to produce more accurate and sensitive peptide sequencing algorithms.9–13 To this end, several data-mining studies have been performed on large sets of mass spectra with the goal of discovering and quantifying the factors that play a role in peptide fragmentation. Some of the notable influences on fragmentation that were examined include: the influence the amino acids adjacent to the cleaved bond have on its observed intensity;10,11,13–15 the effects ceratin amino acids have on the abundance of neutral losses;10,12,16 and how the composition of basic amino acids in the peptide influences the fragmentation.11,15,17 In some cases this information was incorporated into the scoring models used by sequencing algorithms.13,18–25
There have also been additional efforts to incorporate predictions of theoretical spectra into the peptide identification process. A recent method developed by Zhang26,27 for creating realistic theoretical mass spectra relies on a kinetic model that simulates the peptide fragmentation process in order to predict theoretical MS/MS spectra from peptide sequences, which can then be compared with experimental mass spectra for identification purposes. This method was used as the basis for a new scoring function for a database search program that outperformed Sequest and Mascot.28 However, the spectrum prediction process can be difficult at times, and becomes less accurate and more computationally intensive with large peptides and higher charge states. The ByOnic database search algorithm23 takes a different approach, using heuristic-based rules for predicting fragment ion ranks, which it incorporated into its scoring function.
Due to the complexity of peak intensity prediction problem we took on a slightly relaxed problem, and concentrated only on predicting a peptide's relative fragment-ion peak ranks. We call this problem the Peak Rank Prediction Problem and it is outlined in Figure 1. In this problem, we are given a peptide sequence P = p1 ... pn and a set of fragment ions , and we need to predict π, which is a permutation of P's fragment ions. The goal is for the permutation π to resemble, as much as possible, the ordering of peaks, from strongest to weakest, as observed in an experimental mass spectrum of the peptide P. Since the weakest peaks’ intensities are usually below the instrument's detection level, not all n peaks for a fragment type are likely to be observed. We are therefore mainly concerned with accurately predicting the order of the strongest peaks.
We note that while peak intensity predictions can possess more discriminatory power than peak rank predictions, it is often beneficial for sequencing algorithms to rely on peak ranks instead of intensities. In particular, using peak ranks makes an sequencing algorithm less vulnerable to the idiosyncrasies often observed in individual MS/MS spectra.29 In an accompanying manuscript,30 we describe how our peak rank prediction models are incorporated into a scoring function for peptide-spectrum matches that greatly improves the performance of both de novo sequencing and database search algorithms.
Peak rank prediction models can also be useful for tasks out side the realm of peptide sequencing. For example, being able to predict which peptide fragments are likely to be the strongest can be invaluable for selecting optimal transitions in multiple reaction monitoring (MRM) experiments in which one is trying to quantify proteins for which there is insufficient previous MS/MS experimental data to guide to peak selection.31
Peptide fragmentation is a complex process, often involving several competing chemical pathways, including some that have yet to be discovered or thoroughly understood. Much effort has been devoted towards studying the processes that govern peptide fragmentation .32–36 Below we describe some of the prominent mechanisms involved in peptide fragmentation that are described in the literature. Since the data used to train our models was acquired in experiments involving collision induced dissociation (CID), we focus on factors that are known to influence this type of fragmentation.
The most widely accepted model for peptide fragmentation is the mobile proton model.9,37–42 According to this model, during the ionization stage, peptides are protonated at various sites (mainly terminal amino groups and basic side chain groups), with some protonation sites being more favored than others (e.g., the amino group of arginine). The peptide cleavage is initiated by migration of the sequestered charge (proton) from the initial site of protonation to an amide carbonyl oxygen along the peptide backbone. A nearby carbonyl, N-terminal to the protonated carbonyl oxygen, serves as a nucleophile to attack the electropositive carbon of the protonated carbonyl.9,43,44 This cleavage usually produces either a prefix (N-terminal) b-ion or a suffix (C-terminal) y-ion. This type of fragmentation is considered a “charge-directed” process, since it requires the involvement of a proton near the cleavage site.9
The energy required to mobilize a proton from a basic side chain or the terminal amino group depends on the identity of the protonated amino acid, with dissociation energy requirements mimicking the order of amino acid gas phase basicity,36 i.e., the energy required to mobilize a proton that is on arginine > lysine > histidine > non-basic amino acids. In some cases, such as peptides containing arginines, the energy required to mobilize the proton might be too high, making it possible for alternative fragmentation pathways to dominate; pathways that do not involve the ionizing proton (a “charge-remote” fragmentation pathway). These fragmentation events often get initiated by ions containing long side-chains or poly-ring structures.45 The dissociation of protonated peptides can be described as a competition between charge-remote and charge-directed fragmentation pathways, in a complicated reaction pattern where fragment ions are formed with substantially different probabilities.36
The activity of charge-direct and charge-remote pathways can be significantly enhanced or reduced depending on the identity of the amino acids adjacent to the peptide's cleavage site. For instance, the fragmentation of the bond N-terminal to proline is known to be very active when there is a mobile proton.14,46 Often this is the most active site in the peptide, and it is responsible for a very intense peak in the observed spectrum. Similar enhanced fragmentation, though slightly less intense, occurs if we replace the proline with a glycine residue. In contrast, when no mobile protons are available, the charge-remote pathways can dominate and cleavage is enhanced near acidic residues.9,47 In addition to fragmentation of the backbone, there are also pathways that lead to the peptides losing neutrally charged molecules such CO ,NH3 and H2O ,32 which diversifies the set of observed fragment ions.
The training data for our experiments consisted of approximately 320000 unique peptide-spectrums pairs collected from various MS/MS experiments involving low-resolution CID ion-trap mass spectrometers. Most of the data we used was generated in the Briggs lab at UCSD (samples of from human HEK293 cell culture48 and samples from Dictyostelium discoideum49), and the Smith lab at PNNL (samples taken from Shewanella oneidensis MR-150,51).
We used the InsPecT database search tool21 to perform peptide identifications (release 20070613), using the default search parameters (precursor mass tolerance 2.5 Da, fragment ion tolerance 0.5 Da). All searches were performed using a shuffled decoy database.52,53 The InsPecT F-score threshold values for accepting identifications were selected to ensure a true positive peptide identification rate of 98% (i.e., only 2% of the peptide hits came from the decoy database).
Since a peptide's charge and precursor mass can greatly influence the nature of its observed spectrum, we partitioned the training data into distinct sets as described in Table 1, and trained separate models for each partition. The number of partitions generated for each charge depended on the number of training peptides available (we divided the doubly-charged peptides into 6 parts, while the singly-charge were only divided into 3 parts). Partitioning the data this way also gave computational benefits, since many of the partitioned models required several days to train and also used up most of the available RAM on the machines.
A peptide's mass and charge are not the only factors that influence its mass spectrum. The peptide's amino acid composition also has a great influence on the fragmentation. In particular, the number and location of the basic amino acids (namely arginine and lysine) play an important role in influencing which peptide fragmentation mechanisms are most active. To reflect the important role amino acid composition plays in peptide fragmentation, we applied an additional partitioning of the training data according to proton mobility. Following the division suggested by Kapp et al.,15 we placed peptides in three categories, according to their amino acid composition:
To create models for predicting peptide fragment we used a discriminative ranking algorithm called RankBoost due to Freund et al.5 The RankBoost algorithm uses a machine learning method called boosting,54,55 which produces highly accurate prediction rules by combining many “weak” rules that, each on their own, might be only moderately accurate.
To train models with RankBoost we need to supply the algorithm with several inputs:
The goal of the RankBoost learning algorithm is to create a scoring function that weights and combines feature functions from F, in order to induce a ranking that misorders as few pairs as possible in Φ. For the problem at hand, this means that the ranking function H is optimized to give high scores to strong peaks and low scores to weak peaks (when the peaks being compared belong to the same peptide).
The structure of the peak rank prediction problem lends itself naturally to a ranking algorithm-based solution, and in this section we demonstrate how this can be done using the RankBoost algorithm. First, we note that the intensities observed in a peptide's experimental spectrum naturally induce a ranking of the fragment ions (this is done simply by ordering the the peptide's fragment ions according to the intensities observed for them in the spectrum). We can also easily extract from a peptide's mass spectrum a set of ordered pairs of peaks which are required for the algorithm's feedback function Φ. However, the key components in our approach, as in any machine learning-based method, are the features we use to describe the data.
Our feature functions derive all their needed information from the peptide's sequence P = p1 ... pn. Though the peptide's precursor charge also influences the observed peak pattern, as we noted above, we train separate models based on charge, size, and proton mobility state, so this information does not need to be conveyed to the feature functions. The feature functions are used to create a feature vector to describe each peptide fragment. For each peptide cleavage position 1 ≤ i < n and fragment type , we fill a separate feature vector. Ultimately, these vectors are scored by the model to assign each fragment with a ranking score. The rank scores are then used to induce a permutation of the peptide's fragment ion peaks (e.g., if b4's rank score is higher than y7's, this is interpreted as the fact that we expect b4's intensity to be higher than y7's in the experimental spectrum).
There are several different types of sequence-based features that are predictive of a fragment's intensity, which we describe below. The total number of features that can be used to predict a particular fragment's rank is over 200. In each model we only consider the four most abundant fragment types (according to the statistics observed in the training examples for the model's specific charge/size/mobility state). This set of fragments, denoted by F , usually contains the abundant prefix b- and suffix y-ions, and their neutral losses (e.g., b − H2O). Thus, our models can potentially use over 800 features. However, in practice, not all features get included in the models since during the training, many of them do not yield a significant improvement to the ranking performance compared to the features already selected to be in the model (features are only added to a model if they reduce the training ranking error5).
Even though a model might contain hundreds of features, the actual feature vectors that are created for peaks are extremely sparse. On average, most of the features are not relevant to the given peak instance (in this case we say the feature abstains). Typically less than 10−20 feature functions do not abstain (for example see Figure 5). Most of the features are only applicable to a specific fragments type, so for example, only the features belonging to y-ions participate in the scoring of a y-peak, the rest of the features abstain. In addition, many of the features, especially the amino acids adjacency features described below, come in sets of 20 (one feature for each standard amino acid), and for every fragment scored, only one of the features in the set is assigned a nonzero value.
The first feature we mention is an indicator function which holds the type of fragment being examined (e.g., b,y,b − H2O, etc.) This feature is used to give a prior score for the different fragment types (usually the b- and y-ions receive a positive score while the neutral losses get zero). This feature assigns a value with all fragment ions. The rest of the features are designed to assign values to specific fragment types (this is denoted by the superscript t in the feature names). Following is a description of the three classes of features we use: peak location features, adjacent amino acid features, and peptide composition features.
The relative mass of a peak has a significant influence on the peak's observed intensity. Peaks located near the center of the peptide generally have higher intensity than the peaks near the N- and C-terminals.10,56 This could be because the cleavage of the amide bonds might be more energetically favorable near the center of the peptide, than it is near the terminals. Another possible reason for this could be that peaks near the terminals are often close to the edges of the“visible” range of masses which the instruments are calibrated to detect, and can thus suffer from instrumental bias that records a weaker signal for them. Note that the mass spectrometers that generated our training data were calibrated to detect peaks between a minimal mass that is approximately ~ 0.25· precursor mass, and a fixed maximal mass of 2000 Da. For example, if the peptide has a precursor mass of 2500 Da, only b and y peaks that have a mass of 625−2000 Da can be detected. Since other instruments can have significantly different visible mass ranges, this would require training specific models for such new instruments even if the same method for peptide fragmentation method is used.
Let Min be the minimal peak mass that can be detected with a peptide with P's precursor mass; Max be the maximal mass that can be detected (the minimum between P's precursor mass and 2000); and m be theoretical mass of the peak being ranked. There are 3 features (per fragment ion type ) used to describe the location of a peak:
Note that for each instance, only one of the features and gets assigned a value depending on wether the peak is closer to the N-terminal or the C-terminal. The superscript t in the feature names reflects the fact that there is a separate feature created for each fragment ion type .
The identity of the amino acids adjacent to a cleavage site play an important role in determining the propensity of the peptide to undergo fragmentation at that site.10,11,14,15 These influences can be very significant. For instance, the peaks that are products of a cleavage N-terminal to proline are extremely strong14,46 (usually they are much stronger than all other peaks in the spectrum).
We derived a simple set of pattern-based features to describe the amino acids adjacent to a cleavage site. We only look at the amino acids that are up to 3 positions away from the cleavage site, since amino acids farther away do not have as strong an influence on the fragmentation. In total, we created 6 sets of 20 features each, which we describe below. Let X denote one of the 20 standard amino acids; P = p1 ... pn be the peptide sequence for which we want to predict peak ranks; i be the index of the cleavage site we are examining (i.e., i is the number of amino acids that are on the N-terminal side of the cleavage position); and be the fragment type being scored. The indicator function I[[pi = X]] returns 1 if the amino acid pi is X and 0 otherwise. Our 6 sets of features can be expressed using the following indictor functions:
These adjacent amino acid features are sparse. For any given peptide and cleavage index i, at most 6 of the 120 features takes a value 1 (less than 6 features might take a nonzero value if the index i being examined is near the terminals).
Besides examining the amino acids adjacent to cleavage site, we can also extract important information from a peptide's general amino acid composition. For instance, if a peptide has basic amino acids near the N-terminal, rather than the C-terminal, the observed b-ion fragments can be stronger than the y-ion fragments, contrary to what is typically observed with tryptic peptides. Furthermore, there are amino acids that are known to be more prone to neutral losses, such as asparagine which often loses NH3. A simple way to capture this sequence information is to count the number of occurrences of each amino acid on both sides of the cleavage site. We created the following two sets of 20 features for each fragment type :
Note that we exclude the amino acids that are 1−3 amino acids away from the cleavage site, since these are covered by the adjacent amino acid features.
We also take note of the amino acids on the N- and C-terminals, since these specific positions can have a special influence on the fragmentation outcome. We use two sets of features to express this information:
In order to train a model with RankBoost, we need to supply the algorithm with samples from the instance space and a feedback function . For each partition of the training data (see Table 1), we separated the peptides into three sets based on their proton mobility (mobile, partially mobile, and nonmobile), and repeated the following steps:
When we examine repeated MS/MS experiments involving the same peptide, we often observe an element of randomness in the peak intensities. Pairs of low-intensity peaks often have different ranks in different experimental spectra. Therefore, with low intensity peaks, the order observed is often determined not so much due to mechanisms that govern peptide fragmentation, but more due to the randomness of the peak intensity measurements. Applying the weighting scheme in Eq. 1 reduces the influence of these somewhat randomly ordered instance pairs on the models being trained.
During the training we assess the model's performance, by measuring its rank loss, which equals the weighted proportion of pairs of instances in Φ whose order is predicted incorrectly by the model. Measuring performance on the same set of data used to train the model is not recommended since this usually leads to model overfitting. To avoid this pitfall, we set aside a portion of the data to serve as a validation set which is used to test the performance of the model (and determine what configuration of the parameters is optimal in general). We typically set aside 1/4 of the training data for validation purposes.
A typical training set we used to train the models consisted of ~ 10000 peptides, which created ~ 400000 instances (peaks) in , and ~ 4 · 106 pairs of the form in Φ. Our algorithm is capable of performing approximately 2 rounds of boosting per second on this data using a single CPU. The training typically required between 100000 to 400000 rounds to complete. We now examine in detail the training of a specific model for predicting fragments in doubly-charged “mobile” peptides of mass 1100−1300 Da, in order to observe the dynamics of the convergence of the RankBoost model.
Figure 2 depicts the statistics obtained from the training of the ranking model. The greatest reduction in error rates is achieved in the initial boosting rounds. The first feature to be selected in the models is (Relative peak location of y). Using this feature reduced the initial error from about 50% to 33%. By round 10 there are only 4 features in the model, and they alone reduce the ranking error to 26.6%. After 100 rounds, the model includes 17 features, the training error was reduced to 15.1%. From this stage on, adding more features and performing additional rounds of boosting yields a small but steady improvement in the model's performance. After 100000 rounds, using 423 features, the training error was 7.97%. At the stage of the model's convergence at round 200000, the validation error is minimal (7.89%), and the model includes 512 features. Continuing with additional rounds does not improve the validation error, but does start to widen the gap between the validation and training error (to a maximum of 0.08%). This gap, though small, is evidence that the model is starting to experience overfitting (it is being optimized specifically for the training data).
Figure 3 illustrates the dynamics of tuning the scores assigned to a ranking feature. The figure shows a graphical representation of the cumulative contribution of the feature (relative peak location of y), after various numbers of update rounds (in every boosting round only the most beneficial feature has its weights changed). The feature , can have have a value between 0, designating that the peak is located at the minimal detectable mass, and 1, for a peak at the maximal detectable mass. After the first round the rule that was most beneficial was to assign a score +0.5 if the y peak falls between 5%-80% of the mass range. Subsequent update rounds refine the function further, decreasing the score near the extremities and increasing it near the center. This feature's behavior corresponds to a known phenomenon observed with MS/MS spectra of peptides, in which the peaks near the center are, on average, much stronger than the peaks detected in the low and high mass ranges.10,56 These figures also illustrate the typical behavior of the learning algorithm, which makes large changes in the initial rounds, but only smaller refining changes thereafter.
In Figure 4 we compare the feature with peak location features for the other fragment types used in the model: b, b − NH3 and b − H2O. All four features show a similar trend, the scores near the terminals are much lower than the scores near the center of the detected mass range. However, the magnitude of the b and y feature scores is much greater than the magnitude observed for b − NH3 and b − H2O. This reflects the fact that b- and y-ions are typically much stronger (and are ranked higher) than ions with neutral losses.
Table 2 lists the most positive and most negative adjacent amino acid features in the model for the y,b,b − NH3, and b − H2O ions, trained on doubly-charged mobile peptides of mass 1100−1300 Da. In the interest of clearer presentation the names of the features were shortened, omitting (since all features in the same column belong to the same fragment type). Note that with the features listed in the table, as with all our indicator function-based features, we assume that if the indicator returns a value 0, the feature's score is also 0. In addition, since with low-energy CID fragmentation it is impossible to distinguish between the isobaric amino acids leucine and isoleucine, these two amino acids are represented by a single feature (I/L) in all the tables below.
The features listed in Table 2 capture much of the previously observed phenomena regarding the influence of amino acids on peptide fragmentation.10,11,15 With mobile peptides, the most active fragmentation pathways tend to be charge-directed, and most prominent among these is the cleavage N-terminal to proline. This is reflected in the tables with the highest scoring feature for all fragments being “Cut +1 = P”. The influence of proline goes beyond the position directly N-terminal to the cleavage site, since “Cut +2 = P” and “Cut +3 = P” also have high scores. When the proline is on the other site of the cleavage the resulting peaks tend to be extremely weak, and often they do not get detected. Consequently, the features “Cut −1 = P” tend to have negative scores. Glycine (G) is known to have a similar effect to proline (though often weaker).
The tables also show other known effects, such as the positive influence of the aliphatic amino acids (e.g., A,V,I,L) when they appear N-terminal to the cleavage. There are also known negative effects such as when the amino acids serine (S) or threonine (T) appear N-terminal to the cleavage, or when histidine (H) is C-terminal to the cleavage.15,57 The table also shows amino acid influences that are particular to specific fragment ions. For instance serine and threonine that are known to promote water loss,16,32 have positive scores in b − H2O, but not with the other fragment types. Similarly, glutamine (Q) and asparagine (N) are known to increase loss of ammonia which explains their features’ positive scores with b − NH3. Another interesting phenomenon is the negative influence that the acidic amino acids (D,E) have on the intensity of neutral losses, especially b − NH3, which seems to be more prominent than the effect they have on the b- or y-ions. The tables also show the negative effect of having arginine on the C-terminal side of the cleavage (“Cut +1 = R”) with the b,b − NH3 and b − H2O fragments. This can be explained by the fact with doubly-charged mobile peptides, if there is an arginine on the C-terminal side, that means that there is no additional one on the N-terminal side. This sequesters most of the charge on the suffix fragments, which makes the prefix b fragments much weaker. In addition, the fact that this feature is most negative at the +1 position means that the arginine itself also interferes with the fragmentation (this fact was also previously observed15,57).
Table 3 compares the amino acid adjacency features for b-ions across different mobility states (mobile, partially mobile, and nonmobile). The main difference between these models is the role that the charge-remote pathways play (in particular the fragmentation that is initiated by having aspartic acid (D) on the N-terminal side of the cleavage site). With the mobile peptides, the most positive feature is “Cut +1 = P”, while “Cut −1 = D” is not even among the top ten. This starts to change with the partially mobile peptides, where both features have similar strong scores. However, with the nonmobile peptides, the charge-remote pathways become more dominant, which is manifested in the table by a much higher score for “Cut −1 = D”. Another difference, between mobility states is the positive effect of having histidine N-terminal the cleavage site (e.g., “Cut −1 = H”). Note that these feature do not appear in the mobile model because doubly-charged mobile peptides with a tryptic end cannot contain additional basic amino acids like histidine (see definitions of proton mobility states). When examining the most negative features, we see that the lists for the different mobility states are quite similar.
Table 4 examines the most positive and most negative peptide composition features in the model for doubly-charged mobile tryptic peptides with mass 1100−1300 Da. The strongest positive features involve the counts of basic amino acids on both sides of the cleavage site. For the y-ion, the highest scores are given to the presence of basic amino acids on the C-terminal side of the cleavage, while for the b-ion the presence of basic amino acids on the N-terminal side is rewarded. Interestingly, having histidine on the C-terminal is also highly rewarded with the y-ion (the score for having at least one histidine is 0.78). This might be because, in general, having histidine in the vicinity of a cleavage cite increases the fragmentation.57 The presence of aliphatic amino acids (such as leucine and alanine) is also rewarded with high scores. For b − NH3 we note that as expected, having asparagine (N) on the N-terminal side of the cleavage is highly rewarded (asparagine has a tendency to lose NH3). In addition, the positive score for having multiple phenylalanines (F) can be explained by the known fragmentation mechanisms that involve loss of NH3 from aromatic amino acids.58,59 With b − H2O we see positive effect of having histidine (H) in the peptide. We also note the positive scores associated with having threonine (T) and serine (S) on the N-terminal side of the cleavage (the presence of these amino acids is known to increase water loss16,32). As far a the negative features are concerned, it is worthy to note that all feature types are penalized for having proline on the N-terminal side of the cleavage. This reflects the far reaching effects of proline (as seen in Table 3).
Table 5 lists the most positive and most negative N- and C-terminal amino acid features. The strongest features for the y-ion add premiums when the C-terminal is basic (“C-term aa is R” and “C-term aa is K”). There is also a high premium when the N-terminal is proline. This is most likely given to counter the negative scores assigned for the amino acid adjacency features like “Cut −1 = P”, since when a proline is near the terminals its positive and negative effects on fragmentation are diminished. Interestingly, the highest positive terminal feature scores belong to “N-term aa is Q” and “N-term aa is W” for b − NH3 and “N-term is E” for b − H2O. These feature scores are a manifestation of the glutamine/glutamic acid effect,60 in which extensive loss of NH3 occurs when the N-terminal amino acid is glutamine, and a loss of H2O when the amino acid is glutamic acid. Our results also indicate that there is substantial loss of NH3 when tryptophan is on the N−terminal (though involving a different fragmentation mechanism61). The most outstanding negative scoring features are “N-term is Q” and “N-term is E” with the b- and y-ions. This is also due to the glutamic acid/glutamine effects. When the N-terminal amino acid is either glutamine or glutamic acid, the scores of b- and y-ions get reduced in order to increase the ranks of the b − NH3 and b − H2O, respectively.
We trained a total of 39 models for peak rank predictions (3 mobility states × 13 partitions described in Table 1). The experiments below test the accuracy of these models for the peak rank prediction task.
First we examine how peak ranks get predicted in practice. Figure 5 gives a detailed example of the calculation of the peak rank scores of the top three peaks in the doubly-charged peptide GEEVTPISAIR. Part a displays the experimental mass spectrum, in which the top three peaks are y6 (intensity 89.9), y7 (intensity 67.5), and b5 (intensity 45.9). Part b of the figure holds a table that shows how the peak rank scores are calculated (only features that have a nonzero score are listed). The calculated rank scores induce the same rank ordering that is observed in the experimental spectrum (y6 has a score of 4.67, y5 has a score of 4.22, and b5 has a score of 3.22). These scores were computed using the model for mobile doubly-charged peptides of mass 1100−1300 (this model's most prominent features are listed in Table 2 through Table 5). In this model, both the y- and b-ion fragment types receive a score of +0.43 from the “Fragment Indicator” feature (b − NH3 and b − H2O, which are the other two fragments in the model, receive a score of 0). The table shows that the peak location features give a large score premium to all three peaks since they are b- and y-ions situated near the center of the peptides. Generally, the adjacent amino acid features have a lesser weight than the position features, with exception of the features “Cut −1 = P” which gives a high score premium of +1.52 for y6 and +1.37 to b5 (note that y6 and b5 get different scores since they are different fragment ions, and are assigned different features: and , respectively). The peptide composition features generally contribute low scores, with exception of the “# C-side R=1” which gives +0.36 to the y-ions. The major factor that contributes to raising the y-ions scores’ compared to b5's is the feature “C-term aa is R” which gives +0.5 to the y-ions and −0.5 to the b-ion. This feature can be explained by the fact that with a mobile peptide, if the C-terminal is R, this means that most of the charge is sequestered at the C-terminal which leads to strong y-ions and weaker b-ions.
In order to examine how well the peak rank predictions fit experimental spectra, we selected a test case of peptides of length 10 and ran benchmark experiments with them. Figure 6 shows histograms of the ranks predicted for the peaks observed in the experimental spectra with ranks 1,3,5, and 10. The figures show that the rank predictions for the stronger peaks are much more accurate than the rank predictions for weaker peaks. For example, in over 60% of the cases the peak predicted to be strongest was in fact the strongest peak in the spectrum. However, when we examine the tenth strongest peak in the spectrum, we see that only in 9% of the cases, the predicted rank is correct. In addition, with the stronger peaks (ranks 1 and 3), the predicted ranks tend to be more concentrated around the observed rank, while with the weaker peaks (ranks 5 and 10), the predictions are more spread out.
We ran additional benchmark experiments to test the peak rank prediction models on peptides of various lengths and charges. Table 6 holds results of these experiments. In this analysis we examined only the top 7 peaks, since these are the ones that are most likely to be observed in experimental spectra (in addition, the predictions with lower ranked peaks are less accurate, and thus less helpful when matching to experimental spectra.) To evaluate how well our predictions match the observed data, we examine two parameters. First, we computed Spearman's rank correlation coefficient (ρ)62 between a peptide's predicted peak ranks and the ones observed in the experimental spectra:
where di is the difference between i (the predicted rank) and the actual rank observed for the ith fragment in the experimental spectrum, and n is the number of peaks we examined (n = 7). ρ can take values between +1 and −1, where ρ = +1 indicates perfect rank correlation, ρ = −1 indicates perfect negative rank correlation, and ρ with values near 0 indicates that there is very weak, or non existent, correlation between the sets of variables examined. We also examined how well we predicted each of the ranks 1,...,7. Since it is often difficult to make exact rank predictions, we examined the proportion of cases in which our predictions where close to the observed, with the predicted rank p being at most 3 positions away from the observed rank r (i.e., r − 3 ≤ p ≤ r + 3).
There are a few general trends we can observe in the table. First, as evident by the positive values of ρ, with all charge states and peptide lengths examined, our predictions have strong positive correlations with the fragment ranks observed in the experimental spectra. These correlations are strongest with the shorter peptides with charges 1 and 2, where our rank predictions are most accurate. This poorer results with triply-charged peptides are due to the more complex fragmentation processes involved with multiply-charged peptides which also often results in poorer fragmentation patterns that are difficult to predict. Examining the rest of the table's columns reveals that the predictions for stronger peaks are more accurate than weaker peaks (this is evident in all cases were the accuracy deteriorates as we move from rank 1 to rank 7). In addition, the peak rank predictions with shorter peptides are generally more accurate than the ones done with longer peptides. However, the decline in prediction accuracy is not that severe. For instance, there is only a 5.6% reduction in the accuracy of the prediction of the strongest peak when go from doubly-charged peptides of length 10 (92.9% correct) to peptides of length 20 (87.3% correct); even though the number of peak ranks that need to be predicted is doubled.
In this paper we explored how ranking algorithms can be used to predict the ranks of fragment ion peaks in a peptide's experimental mass spectrum based solely on the peptide's sequence (the peak rank prediction problem). This is not a trivial task. Peptide fragmentation is a complex process that involves many competing chemical pathways. It is quite difficult to create generative probabilistic models for mass spectra that consider this wide range of chemical processes. Yet generative models are not required to solve the peak rank prediction problem. The structure of this problem lends itself nicely to a ranking-based solution since the required output is an ordering of instances (we want to predict a permutation for fragment ions, from strongest to weakest). We based our solution to this problem on ranking-based discriminative models; created using a pool of simple sequence-based features. These features get combined by the RankBoost algorithm into strong discriminative models, using a training procedure that is optimized to minimize the number of pairs of peaks on which the model makes ordering mistakes. What makes this approach feasible are the large amounts of MS/MS training data that have become available (we used approximately 320000 unique peptide-spectrum pairs), which enabled us to create detailed models with minimal overfitting.
As opposed to many “black box” machine learning algorithms, like neural networks, the RankBoost algorithm is relatively transparent, and allows us to observe the dynamics involved in creating the models, and later on, the scoring of new instances. We are able to examine the contribution of single features, and in many cases understand the logic that guided the scoring of different instances. The high-scoring features in our models reflect known chemical pathways in peptide fragmentation. Since the CID fragmentation method is well studied, we did not find evidence of novel fragmentation pathways that were not previously described in the literature. However, it is likely that given training data that was generated using a fragmentation method that has not been studied as extensively as CID, our models could supply some interesting insights into the dynamics of the peptide fragmentation.
Our experimental results show that our peak rank predictions are fairly accurate, and can therefore be helpful in detecting correct peptide-spectrum matches. In an accompanying manuscript,30 we incorporate our peak rank prediction models into a scoring function for peptide-spectrum matches that greatly improves the performance of both de novo and database search algorithms. The models we described can also be useful for multiple reaction monitoring (MRM) experiments,31 where they can assist in the selection of transitions in cases where there is no experimental MS/MS data to identify a peptide's strongest fragments.
Our models’ predictions for peak ranks in triply-charged peptides were slightly inferior to the predictions for singly and doubly-charged peptides. This is due to the fact that the dynamics of the fragmentation pathways in triply-charged peptides are more difficult to predict (these peptides are typically longer and contain more basic amino acids). This underscores the need for more complex features that can improve the modeling of such cases (e.g., features that simultaneously describe the location of several amino acids). Another possible way to increase the models’ accuracy is to incorporate into RankBoost a more powerful and expressive learning algorithm (instead of the regular boosting). Alternating decision trees63 might be good choice since they can easily account for relationships and dependencies between simple features and might provide a more accurate model of the complex interactions between fragmentation pathways.
Accurate peak models become essential when trying to identify large peptides with high charge states. CID can reach charges +4 and +5, though spectra with these charges are not identified often. ETD64 spectra routinely posses even higher charges. Top-down mass spectra,65 record the fragmentation of whole proteins and can possess precursor charges that reach dozens and even hundreds of protons. When such large peptides (or proteins) are fragmented, they often undergo several fragmentation events, giving rise to many internal fragments. Often, the internal fragments end up accounting for most of the spectrum's intensity. Since there is a quadratic number of possible internal ions, it becomes especially important for scoring functions to be able to predict which are the few internal fragments that are likely to be observed.
I am grateful to Pavel Pevzner for many inspiring discussions. This research was supported by the “National Center for Research Resources” of the NIH via grant P-41-RR24851. I would also like to acknowledge the UCSD FWGrid Project for the availability of their computational infrastructure. The UCSD FWGrid Project is funded in part by NSF Research Infrastructure Grant number NSF EIA-0303622.