Protein phosphorylation is regarded as a key mechanism for the regulation of many cellular processes including metabolism, cell division and apoptosis (Cohen, 2000
). It has been estimated that >50% of expressed proteins are phosphorylated at some point in their life cycle (Hjerrild and Gammeltoft, 2006
) though only a small fraction of the potential phosphorylation sites have been identified.
In recent years, the examination of complex protein mixtures by tandem mass spectrometry (MS/MS) has become feasible through advances in instrumentation and computational methodologies (Cox and Mann, 2007
). Peptide and protein analysis at the cell extract level has become an almost routine procedure as algorithms such as MASCOT (Perkins et al.
) or SEQUEST (Yates et al.
) among others allow rapid identification of proteins through matching tandem mass spectra to sequence databases.
Statistical methods have been developed to enable the validity of peptide observations to be assessed. Search strategies such as the inclusion of reversed or scrambled sequences in the database can give an estimate of the likely accuracy, or false discovery rate (FDR), for peptide identifications with respect to the database search engine score for a particular experiment (Elias and Gygi, 2007
; Kall et al.
). This allows a reasonably robust description of the protein species in a particular sample. Further work has been performed to match peptide fragmentation patterns to peptide sequences through machine learning techniques such as hybrid support vector machines/Bayesian networks (Klammer et al.
) and decision trees (Elias et al.
) but these are not yet readily applicable to peptides containing post-translational modifications (PTMs).
Although protein and peptide analysis is now a mainstream high-throughput technique, reliable identification of PTMs remains a specialist area. Algorithms designed for peptide database matching can take PTMs into account but are not designed to provide robust identifications. PTMs are often present only in low stoichiometry and ion signals corresponding to phosphopeptides tend to be suppressed in the presence of non-phosphorylated peptides. The low stoichiometry of PTMs requires the use of peptide enrichment approaches such as affinity pull down (Gronborg et al.
), immobilized metal affinity chromatography (Andersson and Porath, 1986
; Stensballe et al.
) or titanium dioxide columns (Pinkse et al.
) to enrich the phosphopeptide complement of a complex mixture.
The peptide enrichment strategies required for PTM identification in complex mixtures compromise the scoring assumptions upon which generic database search strategies are based, rendering it unreliable to take the database search scores as a surrogate for PTM identification accuracy. Consequently, confident identification of PTMs requires the manual inspection of the raw MS/MS spectra by a competent mass spectrometrist, a time-consuming step that acts as a major bottleneck in global PTM analyses where many thousands of spectra may require analysis. This examination requires interpretation of the fragmentation pattern in line with the experience of the scientist, and comparison of multiple interpretations of the data, each of which could be correct if the sample contains multiple isobaric isoforms of the phosphopeptide.
An example is shown in where an experienced mass spectrometrist identifies the phosphoTyrosine (Rank 2 hit) as a better interpretation than the phosphoSerine (top-ranked hit). Full Prophossi reports for these two peptide–spectrum matches (PSMs) are available in the Supplementary Material
Fig. 1. An example of misidentification of the correct phosphorylation site. MASCOT identifies the phosphorylation site as pS16 (peptide score 118), though the expected y5-98 ion is much weaker than the weak y5 ion (blue). The second ranked hit (pY17, score 102) (more ...)
This issue has been approached by two other groups, both of whom apply a post-processing step to standard peptide MS/MS search engine results. Beausoleil et al.
) report a probabilistic method that calculates an Ascore based on the appearance of phosphorylation sites in multiple candidate solutions to the spectrum-database mapping conundrum. They calculate a probability score based on the appearance of site-determining ions, i.e. those fragment ions that would be specific for a particular phosphopeptide isoform. The method is dependent on the SEQUEST search engine to identify the two top phospho-isoform hits and it reports quality data only for the best phospho-isoform hit. It does not consider the possibility of isobaric phospho-isoform mixtures. The closed source implementation of the software prevents user optimization of the search parameters. Smith et al.
) have taken a different approach which is, in some respects, similar to ours. They examine the daughter ion spectrum and assign scores based on a limited range of spectral features. Peptide matches failing to reach a defined score threshold are rejected. Both of the aforementioned methods have some limitations: The Beausoleil method ignores spectral features beyond the site-determining ions and does not consider neutral loss of phosphate from phosphoSerine and phosphoThreonine, resulting in a smaller number of phosphopeptide spectrum matches. The Smith method assigns scores from a limited range of features, which alone may not be sufficient to specifically locate a phosphorylation site, and has not been tested empirically.
Our methodology, described in this article, incorporates a broader range of spectral features and seeks to identify evidence for the specific localization of the putative phosphorylation sites. Thus, every PSM, i.e. every hit from a MASCOT or other search engine search, is assessed on its own merit. The method is, therefore, able to interpret complex spectra derived from multiple isobaric phosphopeptide species. Opinions from three experienced mass spectrometrists were used to derive a set of chemistry-based criteria that could be applied to tandem mass spectra for selection between, and validation of, the database PSM search hits. This method does not perform database searches itself but provides a report on how well the observed spectrum fits to the predicted matches, and whether the predicted match passes these analytical criteria. As such, it can be applied to the results of any such database search.
Typically, relatively few phosphopeptide spectra are observed in proteomics experiments in the absence of specific phosphopeptide enrichment protocols and this low coverage can be treated by hand by an experienced analyst. However, when such enrichment methods are applied, such as in the experiments described here, the proportion of spectra arising from phosphopeptides rapidly expands to a level where automated processing tools are a practical necessity. Our aim was to develop tools that automate rapid processing of large numbers of spectra with few falsely identified phosphorylation sites (high selectivity) and a sufficiently good sensitivity to provide significant coverage. As we are examining proteomes, where little is known about the existing phosphorylation state of the organism, a tool that rapidly and confidently assigns the majority of easy cases is a considerable boost to productivity. All database hits can be assessed, and positive results reported. As all the criteria can be explicitly described in English, marginal hits can be also examined rapidly by an experienced analyst with an appropriate visualization tool. Additionally, a full text report that highlights salient features and annotates the spectrum can be generated. This approach has been validated through assessment of the automated annotation of the Trypanosoma brucei
phosphoproteome (Nett et al.
). We manually examined all identified hits for a specific family of proteins (the protein kinases) and examined the error rate and bias in our automated processing. Our method runs rapidly, allows the assessment of more than just the top hit and gives excellent selectivity with good sensitivity. Output can be via an annotated spectrum and report, produced in HTML or PDF, or via a software application programming interface, allowing integration of the analysis in a high-throughput analysis pipeline. Several of our predictions of occupied T.brucei
phosphoTyrosine sites have been validated experimentally by both western blot and immunofluorescence microscopy experiments using two well-characterized anti-phosphoTyrosine antibodies (Nett et al.