Proteins are created through a biological process called protein biosynthesis. This process begins with transcription and splicing of genes into messenger RNA (mRNA) molecules, which are later translated into polypeptides. At the time of translation, a protein can either be active or inactive, and its subsequent activity is generally regulated by chemical modifications referred to as post-translational modifications (PTMs). PTMs, which may occur during or after translation, involve an enzymatic addition of a chemical group (e.g. a phosphate) or a larger moiety (e.g. an additional polypeptide such as ubiquitin) onto one or more amino acid side chains. Many PTMs, in particular, phosphorylation on serine (S), threonine (T) or tyrosine (Y), can regulate a protein's function by influencing its folding, stability or physical association with other proteins, thereby activating or suppressing it.
Reviews of protein mass spectrometry and the detection of PTM by mass spectrometry can be found in Domon and Aebersold (2006
) and Witze et al. (2007
). Briefly, a typical proteomic MS/MS experiment begins with an enzymatic digestion of proteins into peptides. For a complex mixture, it is common to simplify the mixture by separating peptides based on their chemical properties, such as hydrophobicity using liquid chromatography, before they are ionized and injected into a mass spectrometer. Once in the spectrometer, peptides are grouped and isolated based on their mass-to-charge ratio (simply referred to as mass). Ideally, each group will contain one peptide variant. For each group, the peptides are further broken down by a fragmentation method, such as collision-induced dissociation (CID), to produce ion fragments. The mass spectrometer captures the ion fragmentation pattern for each group in a mass spectrum. The presence of PTMs shifts the mass of the corresponding ions, which changes the ion fragmentation pattern significantly. Specialized PTM search engines, like those discussed above, are used to decipher these ion patterns and map each mass spectrum to a peptide sequence that best explains it.
The use of real and decoy proteins is an established practice for estimating false predictions for MS/MS spectra analysis using database search approaches [Kall et al. (2007
); Kislinger et al. (2003
); Peng et al. (2003
)], including blind PTM search methods (Liu et al., 2008
). Decoy proteins are generated by reversing the amino acid sequence of real proteins; this ensures that real and decoy proteins have the same distributions of amino acids and protein lengths. When using a protein reference database containing an equal number of real and decoy proteins, a random (false) peptide prediction (modified or unmodified) will have an equal likelihood of choosing a peptide from either a real protein or a decoy protein. This allows the number of decoy peptide hits as an estimate for false detection rate.
In practice, blind PTM search methods suffer from two major sources of error: sequence-dependent uncertainty in the modification position (residue position along the peptide sequence where the modification is deemed to occur) and mass inaccuracy for the modification mass. The fragmentation process is often incomplete and the presence of labile PTMs may interfere with this process (Mikesh et al., 2006
). Both issues combined result in MS/MS spectra missing peaks that in turn may lead to ambiguous or erroneous modification predictions. The presences of natural stable isotopes, such as carbon-13, in addition to electronic noise are major contributors to inaccurate mass measurements. This is more prominent in spectra generated from low mass resolution mass spectrometers (e.g. ion trap mass spectrometers), which are still commonly used in today's mass spectrometry studies. shows a diagrammatic representation of the search results obtained from applying the blind PTM search engine SIMS (Liu et al., 2008
) to a set of MS/MS spectra previously mapped to phosphopeptides (Beausoleil et al., 2004
). Enriched for phosphopeptides using a strong cation exchange-based method, the spectra from the complex peptide mixture in the original study were analyzed by a restricted PTM search method designed to look for phosphorylation and were validated manually. The same dataset has been used in benchmark experiments in previous PTM studies [Liu et al. (2008
); Tanner et al. (2005
)]. Phosphorylation is known to occur at ~80 Da and primarily on the amino acid serine (S) and less frequently on threonine (T). Yet the results show, even for those observed peptide sequences that match to the original reference study, that many of their modification (delta) masses and modified amino acid sites deviate from this reference. A closer look (C) shows that many of the amino acids misplaced are a few residues away from their corresponding reference modification position. In a global-scale PTM survey, these issues can make distinguishing true PTM matches from false detections non-trivial; therefore, identifying bona fide
PTMs confidently remains difficult. While these errors can potentially be reduced by technological improvement in instrumentation (e.g. higher mass accuracy mass spectrometers or using alternate fragmentation mechanisms), we sought to develop an algorithm that can deconvolve errors associated with measuring masses and mapping of modification positions simultaneously to salvage both existing datasets and current experimental platforms.
These two sources of error were acknowledged by Tsur et al. (2005
), who briefly described a heuristic approach to account for ‘shadows’ (modifications that are misplaced by a PTM search engine), and later by the same group in the PTM refinement algorithm PTMFinder (Tanner et al., 2008
). However, in the former method, their approach favors high abundance modified peptides, since it requires each peptide match to occur multiple times; discretizes observed modification masses, which introduces additional error with the mass measurements; and can handle only one type of error per peptide (namely either a modification mass error of exactly 1 Da or a modification position misplaced by exactly one residue). The latter method, PTMFinder, takes a machine learning approach where it groups and reanalyzes spectra mapping to the same modified peptide sequence to produce for each spectrum a final peptide sequence with a modification mass and a modification position. This method also suffers from favoring high abundance modified peptides and discretizing observed modification masses. As we show later, it is not always the case that many spectra map to the same modified peptide in a typical genome-wide MS study. These restrictions limit the suitability of both methods' error correction approach to global PTM studies. We note that in addition to correcting for errors with modification mass and modification position estimations, PTMFinder can refine the peptide sequence and provides a P
-value confidence score for both the reported peptide sequence and modification.
Here we introduce a novel generative probability model (PTMClust) that addresses the aforementioned problems encountered when using blind PTM search engines. It accomplishes a significant boost in PTM prediction accuracy and precision by modeling the hidden relationships between the compositions of amino acids in the peptide sequence, specifically the modification mass, the modification position and the identity of the modified amino acid. Our algorithm iterates between clustering modified peptides with similar modifications to form groups, which we call PTM groups, and finding the most likely modification mass and modification position for each peptide based on the grouping. Our method distinguishes itself from others by modeling modifications at the PTM level instead of at the individual peptide level. By rigorous benchmarking, we show that a number of learned PTM groups correspond to known PTMs and many reported modified peptides match to annotated modifications. In addition, our algorithm simultaneously considers PTMs occurring in either the middle or at the terminal ends of a peptide or protein, which provides additional information missed by blind PTM search techniques [Han et al. (2005
); Liu et al. (2006
); Searle et al. (2006
); Tanner et al. (2005
); Tsur et al. (2005
)]. To ensure broad applicability, we have designed and optimized PTMClust to analyze PTM data generated from low resolution MS/MS spectra processed by popular blind PTM search engines, such as those generated from ion trap mass spectrometers.