We have presented Riptide, which models peptide fragmentation chemistry using a collection of DBNs trained from high-quality PSMs. Riptide can provide insights into fragmentation biochemistry, and feature vectors produced by Riptide can be used as input to further machine learning algorithms to improve peptide identification.
The Riptide models generalizes well across PSMs from different organisms: we train our model on PSMs from E.coli and test on PSMs from the yeast Saccharomyces cerevisiae. This good generalization is aided by the DBN machinery's ability to control model complexity through switching parents, dramatically reducing the number of trainable parameters. It is unlikely that a model taking into account, for example, C-term and N-term flanking amino acids could be trained on a few thousand spectra without some analogous parameter-reduction machinery.
Of course, Riptide will likely not generalize well across all types of MS/MS peptide fragmentation data. For example, using different methods of activating peptide ions, such as electron transfer dissociation (ETD) (Mikesh
et al.,
2007) or electron collision dissociation (ECD) (Zubarev,
2004), would likely require retraining the model. Furthermore, very long or very short peptides (as noted in
Section 4.1) may also exhibit different chemistries that subvert the Riptide model. However, one of the benefits of the learning approach used here is that Riptide is not static and can improve as data improves and as technology and protocols change. For example, in this study we focused on fragmentation of tryptic peptides of charge state +2, because these are the most common peptides in the samples we analyze with collision induced dissociation. But different samples generated from different proteases or analyzed with different fragmentation technologies could be used to train the Riptide models. A related advantage of the machine learning approach is that new DBNs can be applied to arbitrary ion series. In this work, we focused on collision-induced dissociation fragmentation spectra from +2 peptides. An obvious extension would be to apply the DBNs to different charge states, such as +1 and +3 or higher. Also, ETD and ECD have been shown to be useful in proteomics, but produce prevalent
c- and
z-ions, rather than
b and
y. Given appropriate training data, Riptide could learn fragmentation patterns from these ion series.
In a sense, the two overall goals of Riptide—learning about peptide fragmentation biochemistry and improving our ability to identify spectra—are at odds with respect to each other. This tension correlates with the observation that, in general, DBNs admit two different methods of parameter training. On the one hand, there is generative training, where optimizing the objective function means that the corresponding joint probability distribution should best describe the data. As a simple example, given a DBN representation of the joint distribution of intensity and peptides Pr(i,p|θ), where θ are model parameters, generative model training adjusts θ so that this joint distribution is as accurate as possible. Discriminative training, on the other hand, adjusts the parameters of the model so that classification accuracy is as high as possible. For example, using Bayes rule, we can form the posterior Pr(p|i, θ)=Pr(i, p|θ)/Pr(i|θ) and then choose the p that maximizes this posterior. Adjusting the parameters θ to minimize the error rate of a so-formed Bayes decision rule would constitute discriminative training. Generative training is computationally cheap relative to discriminative training. Therefore, in this work we have simulated a discriminative training procedure by explicitly training positive and negative models separately. This latter choice was also motivated by the desire to obtain interpretable probabilistic parameters, which a model trained solely on positive PSMs allows. In future work, we plan to experiment by using a fully discriminative Riptide model for peptide identification and using a separate, fully generative model for investigating fragmentation phenomena.
Although Riptide is relatively fast in real time (on the order of a minute per spectrum for the databases considered here), it is slow compared to other commonly used PSM evaluation metrics, such as Xcorr. This is tolerable, because there is a long history in MS/MS analysis software of using fast preliminary scores to pre-filter peptides before handing them off to the sensitive, yet expensive, final scoring routines. The running time for Riptide to score a given spectrum scales approximately as O(lNpNilog(Ns)), where l is the average length of a peptide, Np is the number of candidate peptides for that spectrum, Ni is the number of ion series under consideration, and Ns is the number of peaks in the particular spectrum.
Currently Riptide is implemented in a combination of C++ and Python code, using the GMTK package for dynamic Bayesian network analysis. GMTK is freely available, and the C++ code is available from the authors upon request. In the near future, we plan to migrate Riptide to C and integrate the code into the sequence database search package Crux. (C.Y. Park et al., In Press). Ultimately, the Crux package will incorporate the probabilities produced by Riptide for PSMs into probabilities for protein identification.