Phosphorylation site assignment of high throughput tandem mass spectrometry (LC-MS/MS) data is one of the most common and critical aspects of phosphoproteomics. Correctly assigning phosphorylated residues helps us understand their biological significance. The design of common search algorithms (such as Sequest, Mascot etc.) do not incorporate site assignment; therefore additional algorithms are essential to assign phosphorylation sites for mass spectrometry data. The main contribution of this study is the design and implementation of a linear time and space dynamic programming strategy for phosphorylation site assignment referred to as PhosSA. The proposed algorithm uses summation of peak intensities associated with theoretical spectra as an objective function. Quality control of the assigned sites is achieved using a post-processing redundancy criteria that indicates the signal-to-noise ratio properties of the fragmented spectra. The quality assessment of the algorithm was determined using experimentally generated data sets using synthetic peptides for which phosphorylation sites were known. We report that PhosSA was able to achieve a high degree of accuracy and sensitivity with all the experimentally generated mass spectrometry data sets. The implemented algorithm is shown to be extremely fast and scalable with increasing number of spectra (we report up to 0.5 million spectra/hour on a moderate workstation). The algorithm is designed to accept results from both Sequest and Mascot search engines. An executable is freely available at http://helixweb.nih.gov/ESBL/PhosSA/ for academic research purposes.
Tandem mass spectrometry (MS/MS) is a widely used method for proteome-wide analysis of protein expression and post-translational modifications (PTMs). The thousands of MS/MS spectra produced from a single experiment pose a major challenge for downstream analysis. Standard programs, such as Mascot, provide peptide assignments for many of the spectra, including identification of PTM sites, but these results are plagued by false positive identifications. In phosphoproteomics experiments only a single peptide assignment is typically available to support identification of each phosphorylation site, so minimizing false positives is critical. Thus, tedious manual validation is often required to increase confidence in the spectral assignments.
We have developed phoMSVal, an open-source platform for managing MS/MS data and automatically validating identified phosphopeptides. We tested five classification algorithms with 17 extracted features to separate correct peptide assignments from incorrect ones using over 3000 manually curated spectra. The naive Bayes algorithm was among the best classifiers with an area under the ROC curve value of 97% and positive predictive value of 97% for phosphotyrosine data. This classifier required only three features to achieve a 76% decrease in false positives as compared to Mascot while retaining 97% of true positives. This algorithm was able to classify an independent phosphoserine/threonine dataset with area under ROC curve value of 93% and positive predictive value of 91%, demonstrating the applicability of this method for all types of phospho-MS/MS data. PhoMSVal is available at http://csbi.ltdk.helsinki.fi/phomsval
bioinformatics; data management; feature selection; machine learning; phosphoproteomics
Motivation: The 14-3-3 family of phosphoprotein-binding proteins regulates many cellular processes by docking onto pairs of phosphorylated Ser and Thr residues in a constellation of intracellular targets. Therefore, there is a pressing need to develop new prediction methods that use an updated set of 14-3-3-binding motifs for the identification of new 14-3-3 targets and to prioritize the downstream analysis of >2000 potential interactors identified in high-throughput experiments.
Results: Here, a comprehensive set of 14-3-3-binding targets from the literature was used to develop 14-3-3-binding phosphosite predictors. Position-specific scoring matrix, support vector machines (SVM) and artificial neural network (ANN) classification methods were trained to discriminate experimentally determined 14-3-3-binding motifs from non-binding phosphopeptides. ANN, position-specific scoring matrix and SVM methods showed best performance for a motif window spanning from −6 to +4 around the binding phosphosite, achieving Matthews correlation coefficient of up to 0.60. Blind prediction showed that all three methods outperform two popular 14-3-3-binding site predictors, Scansite and ELM. The new methods were used for prediction of 14-3-3-binding phosphosites in the human proteome. Experimental analysis of high-scoring predictions in the FAM122A and FAM122B proteins confirms the predictions and suggests the new 14-3-3-predictors will be generally useful.
Availability and implementation: A standalone prediction web server is available at http://www.compbio.dundee.ac.uk/1433pred. Human candidate 14-3-3-binding phosphosites were integrated in ANIA: ANnotation and Integrated Analysis of the 14-3-3 interactome database.
email@example.com or firstname.lastname@example.org
Supplementary data are available at Bioinformatics online.
Automated database search engines are one of the fundamental engines of high-throughput proteomics enabling daily identifications of hundreds of thousands of peptides and proteins from tandem mass (MS/MS) spectrometry data. Nevertheless, this automation also makes it humanly impossible to manually validate the vast lists of resulting identifications from such high-throughput searches. This challenge is usually addressed by using a Target-Decoy Approach (TDA) to impose an empirical False Discovery Rate (FDR) at a pre-determined threshold x% with the expectation that at most x% of the returned identifications would be false positives. But despite the fundamental importance of FDR estimates in ensuring the utility of large lists of identifications, there is surprisingly little consensus on exactly how TDA should be applied to minimize the chances of biased FDR estimates. In fact, since less rigorous TDA/FDR estimates tend to result in more identifications (at higher 'true' FDR), there is often little incentive to enforce strict TDA/FDR procedures in studies where the major metric of success is the size of the list of identifications and there are no follow up studies imposing hard cost constraints on the number of reported false positives.
Here we address the problem of the accuracy of TDA estimates of empirical FDR. Using MS/MS spectra from samples where we were able to define a factual FDR estimator of 'true' FDR we evaluate several popular variants of the TDA procedure in a variety of database search contexts. We show that the fraction of false identifications can sometimes be over 10× higher than reported and may be unavoidably high for certain types of searches. In addition, we further report that the two-pass search strategy seems the most promising database search strategy.
While unavoidably constrained by the particulars of any specific evaluation dataset, our observations support a series of recommendations towards maximizing the number of resulting identifications while controlling database searches with robust and reproducible TDA estimation of empirical FDR.
We developed and compared two approaches for automated validation of phosphopeptides tandem mass spectra identified using database searching algorithms. Phosphopeptide identifications were obtained through SEQUEST searches of a protein database appended with its decoy (reversed sequences). Statistical evaluation and iterative searches were employed to create a high quality dataset of phosphopeptides. Automation of post-search validation was approached by two different strategies. By using statistical multiple testing, we calculate a p-value for each tentative peptide phosphorylation. In a second method, we use a support vector machine (a machine learning algorithm) binary classifier to predict whether a tentative peptide phosphorylation is true or not. We show good agreement (85%) between post-search validation of phosphopeptide/spectrum matches by multiple testing and that from support vector machines. Automatic methods confirm very well with manual expert validation in a blinded test. Additionally, the algorithms were tested on the identification of synthetic phosphopeptides. We show that phosphate neutral losses in tandem mass spectra can be used to assess the correctness of phosphopeptide/spectrum matches. An SVM classifier with a radial basis function provided classification accuracy from 95.7% to 96.8% of the positive dataset, depending on search algorithm used. Establishing the efficacy of an identification is a necessary step for further post-search interrogation of the spectra for complete localization of phosphorylation sites. Our current implementation performs validation of phosphoserine/phosphothreonine containing peptides having 1 or 2 phosphorylation sites from data gathered on an ion trap mass spectrometer. The SVM-based algorithm has been implemented in a software package DeBunker. We illustrate the application of the SVM-based software DeBunker on a large phosphorylation dataset.
Accurate determination of protein phosphorylation is challenging, particularly for researchers who lack access to a high-accuracy mass spectrometer. In this study, multiple protocols were used to enrich phosphopeptides, and a rigorous filtering workflow was used to analyze the resulting samples. Phosphopeptides were enriched from cultured rat renal proximal tubule cells using three commonly used protocols and a dual method that combines separate immobilized metal affinity chromatography (IMAC) and titanium dioxide (TiO2) chromatography, termed dual IMAC (DIMAC). Phosphopeptides from all four enrichment strategies were analyzed by liquid chromatography-multiple levels of mass spectrometry (LC-MSn) neutral-loss scanning using a linear ion trap mass spectrometer. Initially, the resulting MS2 and MS3 spectra were analyzed using PeptideProphet and database search engine thresholds that produced a false discovery rate (FDR) of <1.5% when searched against a reverse database. However, only 40% of the potential phosphopeptides were confirmed by manual validation. The combined analyses yielded 110 confidently identified phosphopeptides. Using less-stringent initial filtering thresholds (FDR of 7–9%), followed by rigorous manual validation, 262 unique phosphopeptides, including 111 novel phosphorylation sites, were identified confidently. Thus, traditional methods of data filtering within widely accepted FDRs were inadequate for the analysis of low-resolution phosphopeptide spectra. However, the combination of a streamlined front-end enrichment strategy and rigorous manual spectral validation allowed for confident phosphopeptide identifications from a complex sample using a low-resolution ion trap mass spectrometer.
phosphoproteomics; DIMAC enrichment; LC-MSn; kidney proximal tubule; Wistar rat kidney proximal tubule cells
Motivation: Although many methods and statistical approaches have been developed for protein identification by mass spectrometry, the problem of accurate assessment of statistical significance of protein identifications remains an open question. The main issues are as follows: (i) statistical significance of inferring peptide from experimental mass spectra must be platform independent and spectrum specific and (ii) individual spectrum matches at the peptide level must be combined into a single statistical measure at the protein level.
Results: We present a method and software to assign statistical significance to protein identifications from search engines for mass spectrometric data. The approach is based on asymptotic theory of order statistics. The parameters of the asymptotic distributions of identification scores are estimated for each spectrum individually. The method relies on new unbiased estimators for parameters of extreme value distribution. The estimated parameters are used to assign a spectrum-specific P-value to each peptide-spectrum match. The protein-level confidence measure combines P-values of peptide-to-spectrum matches.
Conclusion: We extensively tested the method using triplicate mouse and yeast high-throughput proteomic experiments. The proposed statistical approach improves the sensitivity of protein identifications without compromising specificity. While the method was primarily designed to work with Mascot, it is platform-independent and is applicable to any search engine which outputs a single score for a peptide-spectrum match. We demonstrate this by testing the method in conjunction with X!Tandem.
Availability: The software is available for download at ftp://genetics.bwh.harvard.edu/SSPV/.
Supplementary information: Supplementary data are available at Bioinformatics online.
MassMatrix is a program that matches tandem mass spectra with theoretical peptide sequences derived from a protein database. The program uses a mass accuracy sensitive probabilistic score model to rank peptide matches. The tandem mass spectrometry search software was evaluated by use of a high mass accuracy data set and its results compared with those from Mascot, SEQUEST, X!Tandem, and OMSSA. For the high mass accuracy data, MassMatrix provided better sensitivity than Mascot, SEQUEST, X!Tandem, and OMSSA for a given specificity and the percentage of false positives was 2%. More importantly all manually validated true positives corresponded to a unique peptide/spectrum match. The presence of decoy sequence and additional variable post-translational modifications did not significantly affect the results from the high mass accuracy search. MassMatrix performs well when compared with Mascot, SEQUEST, X!Tandem, and OMSSA with regard to search time. MassMatrix was also run on a distributed memory clusters and achieved search speeds of ~100,000 spectra per hour when searching against a complete human database with 8 variable modifications. The algorithm is available for public searches at http://www.massmatrix.net.
Tandem mass spectra; Database search; High mass accuracy; Proteomics; Post-translational modification
Objective: Analyze how precursor and fragment mass tolerance affect the number of true positives and false positives. Introduction: Mass spectrometry coupled to database searching is a powerful and popular protein identification tool. A typical shotgun proteomics experiment begins with degrading intact proteins into peptides. The peptide mixture then undergoes LC-MS/MS analysis, and the resulting experimental spectra are compared to theoretical spectra derived from protein, cDNA, or EST databases. Successful database searching is dependent on database size, post-translational modifications, and precursor and fragment ion m/z tolerance. Method: A standard protein set was made containing 62 verified T. cruzi recombinant proteins spiked into an E. coli lysate. This mixture was digested then analyzed by LC-MS/MS using an LTQ-Orbitrap. Resulting spectra were searched against forward, reverse, and concatenated databases using Sequest, Mascot, and X!Tandem. Peptide probabilities were calculated using ProteinProphet, and peptide false discovery rates (FDR's) were calculated by using ProteoIQ. It is necessary to use a standardized protein mixture to determine the number of true positives (T. cruzi proteins) and false positives (random proteins) found as a function of m/z search tolerance. Preliminary Results: At a 95% probability, more true positives are discovered as ion precursor mass accuracy is increased; however, more false positives are also discovered and at a higher rate. For example, as mass accuracy is increased from +/−1000ppm to +/−20ppm, the number of spectra corresponding to true positives increases by 50% while the number for false positives increases by 380%. Using a 5% FDR filter with the same mass accuracy change yields a 37% increase in true positive matches, while leaving the number of false positives unchanged. Conclusions: FDR filtering can result in more successful data validation than probability filtering when performing high resolution mass spectrometry.
Tandem mass spectrometry has become particularly useful for the rapid identification and characterization of protein components of complex biological mixtures. Powerful database search methods have been developed for the peptide identification, such as SEQUEST and MASCOT, which are implemented by comparing the mass spectra obtained from unknown proteins or peptides with theoretically predicted spectra derived from protein databases. However, the majority of spectra generated from a mass spectrometry experiment are of too poor quality to be interpreted while some of spectra with high quality cannot be interpreted by one method but perhaps by others. Hence a filtering algorithm that removes those spectra with poor quality prior to the database search is appealing.
This paper proposes a support vector machine (SVM) based approach to assess the quality of tandem mass spectra. Each mass spectrum is mapping into the 16 proposed features to describe its quality. Based the results from SEQUEST, four SVM classifiers with the input of the 16 features are trained and tested on ISB data and TOV data, respectively. The superior performance of the proposed SVM classifiers is illustrated both by the comparison with the existing classifiers and by the validation in terms of MASCOT search results.
The proposed method can be employed to effectively remove the poor quality spectra before the spectral searching, and also to find the more peptides or post-translational peptides from spectra with high quality using different search engines or de novo method.
Tandem mass spectrometry has emerged as a cornerstone of high throughput proteomic studies owing in part to various high throughput search engines which are used to interpret these tandem mass spectra. However, majority of experimental tandem mass spectra cannot be interpreted by any existing methods. There are many reasons why this happens. However, one of the most important reasons is that majority of experimental spectra are of too poor quality to be interpretable. It wastes time to interpret these "uninterpretable" spectra by any methods. On the other hand, some spectra of high quality are not able to get a score high enough to be interpreted by existing search engines because there are many similar peptides in the searched database. However, such spectra may be good enough to be interpreted by de novo methods or manually verifying methods. Therefore, it is worth in developing a method for assessing spectral quality, which can used for filtering the spectra of poor quality before any interpretation attempts or for finding the most potential candidates for de novo methods or manually verifying methods.
This paper develops a novel method to assess the quality of tandem mass spectra, which can eliminate majority of poor quality spectra while losing very minority of high quality spectra. First, a number of features are proposed to describe the quality of tandem mass spectra. The proposed method maps each tandem spectrum into a feature vector. Then Fisher linear discriminant analysis (FLDA) is employed to construct the classifier (the filter) which discriminates the high quality spectra from the poor quality ones. The proposed method has been tested on two tandem mass spectra datasets acquired by ion trap mass spectrometers.
Computational experiments illustrate that the proposed method outperforms the existing ones. The proposed method is generic, and is expected to be applicable to assessing the quality of spectra acquired by instruments other than ion trap mass spectrometers.
We present a workflow using an ETD-optimised version of Mascot Percolator and a modified version of SLoMo (turbo-SLoMo) for analysis of phosphoproteomic data. We have benchmarked this against several database searching algorithms and phosphorylation site localisation tools and show that it offers highly sensitive and confident phosphopeptide identification and site assignment with PSM-level statistics, enabling rigorous comparison of data acquisition methods. We analysed the Plasmodium falciparum schizont phosphoproteome using for the first time, a data-dependent neutral loss-triggered-ETD (DDNL) strategy and a conventional decision-tree method. At a posterior error probability threshold of 0.01, similar numbers of PSMs were identified using both methods with a 73% overlap in phosphopeptide identifications. The false discovery rate associated with spectral pairs where DDNL CID/ETD identified the same phosphopeptide was < 1%. 72% of phosphorylation site assignments using turbo-SLoMo without any score filtering, were identical and 99.8% of these cases are associated with a false localisation rate of < 5%. We show that DDNL acquisition is a useful approach for phosphoproteomics and results in an increased confidence in phosphopeptide identification without compromising sensitivity or duty cycle. Furthermore, the combination of Mascot Percolator and turbo-SLoMo represents a robust workflow for phosphoproteomic data analysis using CID and ETD fragmentation.
Protein phosphorylation is a ubiquitous post-translational modification that regulates protein function. Mass spectrometry-based approaches have revolutionised its analysis on a large-scale but phosphorylation sites are often identified by single phosphopeptides and therefore require more rigorous data analysis to unsure that sites are identified with high confidence for follow-up experiments to investigate their biological significance. The coverage and confidence of phosphoproteomic experiments can be enhanced by the use of multiple complementary fragmentation methods. Here we have benchmarked a data analysis pipeline for analysis of phosphoproteomic data generated using CID and ETD fragmentation and used it to demonstrate the utility of a data-dependent neutral loss triggered ETD fragmentation strategy for high confidence phosphopeptide identification and phosphorylation site localisation.
•We report and benchmark a data analysis pipeline for phosphoproteomic data analysis.•Combined use of Mascot Percolator and turbo-SLoMo to compare fragmentation methods•CID and ETD fragmentation for phosphorylation site identification•Demonstrate the utility of data-dependent neutral loss triggered ETD fragmentation•High confidence of phosphoproteomic analysis using ETD/CID spectral pairs
CID, collision induced dissociation; ECD, electron capture dissociation; ETD, electron transfer dissociation; ETcaD, ETD with (supplemental activation); FDR, false discovery rate; FLR, false localisation rate; IMAC, immobilised metal affinity chromatography; LC, liquid chromatography; MS, mass spectrometry; MS/MS, tandem mass spectrometry; PEP, posterior error probability; PSM, peptide spectrum match; SCX, strong cation exchange; Phosphoproteomics; Phosphorylation; Mass spectrometry; Post-translational modifications; Electron transfer dissociation; Plasmodium falciparum
In proteomics workflows, proteins are often digested first, then peptides are separated and subjected to identification by mass spectrometry (e.g., 2D-LC). In this process the peptide assignment to a protein is lost and has to be rebuilt by bioinformatic methods. We present ProteinExtractor, a module of the ProteinScape Bioinformatics Platform, which uses an empiric, iterative method to derive minimal protein lists from peptide search results, which may even come from different search algorithms or different MS datasets.
ProteinExtractor uses an iterative approach to generate a minimal protein list. With composite database searches ProteinExtractor allows measuring the false-positive rate of the protein list. A test dataset (five recombinant proteins, 408 spectra, Bruker Ultraflex), and a real-life dataset (200410 LC/ESI-MS/MS spectra, Bruker Esquire HCT-Ultra, and 11619 LC/MALDI-MS/MS spectra, Bruker Ultraflex, both obtained from an analysis of proteins from a human cell line—SW480) were analyzed.
The most probable protein sequence entries contained in the test dataset were identified with intensive manual data interpretation by several mass spectrometry experts. Using standard search algorithms, the correct protein sequence database entries are scattered over the first 171 protein ranks. Together with application specialists, we developed a set of rules to define a minimal protein list containing only those proteins (and isoforms) that can be unequivocally distinguished on the basis of MS/MS data. Applying these rules, the correct five proteins are ranked within the top eight protein candidates.
In the real-life dataset, the peptide search results of Mascot, Sequest, Phenyx, and ProteinSolver were merged using ProteinExtractor. Merging all four search algorithms, over 50% more proteins could be identified than by using Mascot alone (with a false-positive rate of less then 2.5%). Merging ESI and MALDI data together, another 25% more proteins could be identified.
A one-step enzymatic reaction for improving the collision-induced dissociation (CID)-based tandem mass spectrometry (MS/MS) analysis of phosphorylated peptides in an ion trap is presented. Carboxypeptidase-B (CBP-B) was used to selectively remove C-terminal arginine or lysine residues from phosphorylated tryptic/Lys-C peptides prior to their MS/MS analysis by CID with a Paul-type ion trap. Removal of this basic C-terminal residue served to limit the extent of gas-phase neutral loss of phosphoric acid (H3PO4), favoring the formation of diagnostic b and y ions as determined by an increase in both the number and relative intensities of the sequence-specific product ions. Such differential fragmentation is particularly valuable when the H3PO4 elimination is so predominant that localizing the phosphorylation site on the peptide sequence is hindered. Improvement in the quality of tandem mass spectral data generated by CID upon CBP-B treatment resulted in greater confidence both in assignment of the phosphopeptide primary sequence and for pinpointing the site of phosphorylation. Higher Mascot ion scores were also generated, combined with lower expectation values and higher delta scores for improved confidence in site assignment; Ascore values also improved. These results are rationalized in accordance with the accepted mechanisms for the elimination of H3PO4 upon low energy CID and insights into the factors dictating the observed dissociation pathways are presented. We anticipate this approach will be of utility in the MS analysis of phosphorylated peptides, especially when alternative electron-driven fragmentation techniques are not available.
Electronic supplementary material
The online version of this article (doi:10.1007/s13361-013-0770-2) contains supplementary material, which is available to authorized users.
Collision-induced dissociation; Phosphorylation; Mobile proton; Dissociation mechanisms; Tandem mass spectrometry; Ion trap mass spectrometry; Carboxypeptidase-B
Protein identification using mass spectrometry is an important tool in many areas of the life sciences, and in proteomics research in particular. Increasing the number of proteins correctly identified is dependent on the ability to include new knowledge about the mass spectrometry fragmentation process, into computational algorithms designed to separate true matches of peptides to unidentified mass spectra from spurious matches. This discrimination is achieved by computing a function of the various features of the potential match between the observed and theoretical spectra to give a numerical approximation of their similarity. It is these underlying "metrics" that determine the ability of a protein identification package to maximise correct identifications while limiting false discovery rates. There is currently no software available specifically for the simple implementation and analysis of arbitrary novel metrics for peptide matching and for the exploration of fragmentation patterns for a given dataset.
We present Harvest: an open source software tool for analysing fragmentation patterns and assessing the power of a new piece of information about the MS/MS fragmentation process to more clearly differentiate between correct and random peptide assignments. We demonstrate this functionality using data metrics derived from the properties of individual datasets in a peptide identification context. Using Harvest, we demonstrate how the development of such metrics may improve correct peptide assignment confidence in the context of a high-throughput proteomics experiment and characterise properties of peptide fragmentation.
Harvest provides a simple framework in C++ for analysing and prototyping metrics for peptide matching, the core of the protein identification problem. It is not a protein identification package and answers a different research question to packages such as Sequest, Mascot, X!Tandem, and other protein identification packages. It does not aim to maximise the number of assigned peptides from a set of unknown spectra, but instead provides a method by which researchers can explore fragmentation properties and assess the power of novel metrics for peptide matching in the context of a given experiment. Metrics developed using Harvest may then become candidates for later integration into protein identification packages.
Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples. The peptide fragmentation spectra generated by these workflows exhibit characteristic fragmentation patterns that can be used to identify the peptide. In other fields, where the compounds of interest do not have the convenient linear structure of peptides, fragmentation spectra are identified by comparing new spectra with libraries of identified spectra, an approach called spectral matching. In contrast to sequence-based tandem mass spectrometry search engines used for peptides, spectral matching can make use of the intensities of fragment peaks in library spectra to assess the quality of a match. We evaluate a hidden Markov model approach (HMMatch) to spectral matching, in which many examples of a peptide's fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity. We demonstrate that HMMatch has good specificity and superior sensitivity, compared to sequence database search engines such as X!Tandem. HMMatch achieves good results from relatively few training spectra, is fast to train, and can evaluate many spectra per second. A statistical significance model permits HMMatch scores to be compared with each other, and with other peptide identification tools, on a unified scale. HMMatch shows a similar degree of concordance with X!Tandem, Mascot, and NIST's MS Search, as they do with each other, suggesting that each tool can assign peptides to spectra that the others miss. Finally, we show that it is possible to extrapolate HMMatch models beyond a single peptide's training spectra to the spectra of related peptides, expanding the application of spectral matching techniques beyond the set of peptides previously observed.
computational molecular biology; mass spectroscopy; HMM; peptide identification; algorithms
Phosphorylation is a protein posttranslational modification. It is responsible of the activation/inactivation of disease-related pathways, thanks to its role of “molecular switch.” The study of phosphorylated proteins becomes a key point for the proteomic analyses focused on the identification of diagnostic/therapeutic targets. Liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS) is the most widely used analytical approach. Although unmodified peptides are automatically identified by consolidated algorithms, phosphopeptides still require automated tools to avoid time-consuming manual interpretation. To improve phosphopeptide identification efficiency, a novel procedure was developed and implemented in a Perl/C tool called PhosphoHunter, here proposed and evaluated. It includes a preliminary heuristic step for filtering out the MS/MS spectra produced by nonphosphorylated peptides before sequence identification. A method to assess the statistical significance of identified phosphopeptides was also formulated. PhosphoHunter performance was tested on a dataset of 1500 MS/MS spectra and it was compared with two other tools: Mascot and Inspect. Comparisons demonstrated that a strong point of PhosphoHunter is sensitivity, suggesting that it is able to identify real phosphopeptides with superior performance. Performance indexes depend on a single parameter (intensity threshold) that users can tune according to the study aim. All the three tools localized >90% of phosphosites.
Interpreting the potentially vast
number of hypotheses generated
by a shotgun proteomics experiment requires a valid and accurate procedure
for assigning statistical confidence estimates to identified tandem
mass spectra. Despite the crucial role such procedures play in most
high-throughput proteomics experiments, the scientific literature
has not reached a consensus about the best confidence estimation methodology.
In this work, we evaluate, using theoretical and empirical analysis,
four previously proposed protocols for estimating the false discovery
rate (FDR) associated with a set of identified tandem mass spectra:
two variants of the target-decoy competition protocol (TDC) of Elias
and Gygi and two variants of the separate target-decoy search protocol
of Käll et al. Our analysis reveals significant biases in the
two separate target-decoy search protocols. Moreover, the one TDC
protocol that provides an unbiased FDR estimate among the target PSMs
does so at the cost of forfeiting a random subset of high-scoring
spectrum identifications. We therefore propose the mix-max procedure
to provide unbiased, accurate FDR estimates in the presence of well-calibrated
scores. The method avoids biases associated with the two separate
target-decoy search protocols and also avoids the propensity for target-decoy
competition to discard a random subset of high-scoring target identifications.
mass spectrometry; spectrum identification; false discovery rate
Tandem mass spectrometry (MS/MS) is frequently used in the identification of peptides and proteins. Typical proteomic experiments rely on algorithms such as SEQUEST and MASCOT to compare thousands of tandem mass spectra against the theoretical fragment ion spectra of peptides in a database. The probabilities that these spectrum-to-sequence assignments are correct can be determined by statistical software such as PeptideProphet or through estimations based on reverse or decoy databases. However, many of the software applications that assign probabilities for MS/MS spectra to sequence matches were developed using training datasets from 3D ion-trap mass spectrometers. Given the variety of types of mass spectrometers that have become commercially available over the last five years, we sought to generate a dataset of reference data covering multiple instrumentation platforms to facilitate both the refinement of existing computational approaches and the development of novel software tools. We analyzed the proteolytic peptides in a mixture of tryptic digests of 18 proteins, named the “ISB standard protein mix”, using 8 different mass spectrometers. These include linear and 3D ion traps, two quadrupole time-of-flight platforms (qq-TOF) and two MALDI-TOF-TOF platforms. The resulting dataset, which has been named the Standard Protein Mix Database, consists of over 1.1 million spectra in 150+ replicate runs on the mass spectrometers. The data were inspected for quality of separation and searched using SEQUEST. All data, including the native raw instrument and mzXML formats and the PeptideProphet validated peptide assignments, are available at http://regis-web.systemsbiology.net/PublicDatasets/.
Proteomics; reference dataset; database search software; standard protein mix; Standard Protein Mix Database
Phosphorylation site assignment of large-scale data from high throughput tandem mass spectrometry (LC-MS/MS) data is an important aspect of phosphoproteomics. Correct assignment of phosphorylated residue(s) is important for functional interpretation of the data within a biological context. Common search algorithms (Sequest etc.) for mass spectrometry data are not designed for accurate site assignment; thus, additional algorithms are needed. In this paper, we propose a linear-time and linear-space dynamic programming strategy for phosphorylation site assignment. The algorithm, referred to as PhosSA, optimizes the objective function defined as the summation of peak intensities that are associated with theoretical phosphopeptide fragmentation ions. Quality control is achieved through the use of a post-processing criteria whose value is indicative of the signal-to-noise (S/N) properties and redundancy of the fragmentation spectra. The algorithm is tested using experimentally generated data sets of peptides with known phosphorylation sites while varying the fragmentation strategy (CID or HCD) and molar amounts of the peptides. The algorithm is also compatible with various peptide labeling strategies including SILAC and iTRAQ. PhosSA is shown to achieve > 99% accuracy with a high degree of sensitivity. The algorithm is extremely fast and scalable (able to process up to 0.5 million peptides in an hour). The implemented algorithm is freely available at http://helixweb.nih.gov/ESBL/PhosSA/ for academic purposes.
Canine babesiosis is a tick-borne disease that is caused by the haemoprotozoan parasites of the genus Babesia. There are limited data on serum proteomics in dogs, and none of the effect of babesiosis on the serum proteome. The aim of this study was to identify the potential serum biomarkers of babesiosis using proteomic techniques in order to increase our understanding about disease pathogenesis.
Serum samples were collected from 25 dogs of various breeds and sex with naturally occurring babesiosis caused by B. canis canis. Blood was collected on the day of admission (day 0), and subsequently on the 1st and 6th day of treatment.
Two-dimensional electrophoresis (2DE) of pooled serum samples of dogs with naturally occurring babesiosis (day 0, day 1 and day 6) and healthy dogs were run in triplicate. 2DE image analysis showed 64 differentially expressed spots with p ≤ 0.05 and 49 spots with fold change ≥2. Six selected spots were excised manually and subjected to trypsin digest prior to identification by electrospray ionisation mass spectrometry on an Amazon ion trap tandem mass spectrometry (MS/MS). Mass spectrometry data was processed using Data Analysis software and the automated Matrix Science Mascot Daemon server. Protein identifications were assigned using the Mascot search engine to interrogate protein sequences in the NCBI Genbank database.
A number of differentially expressed serum proteins involved in inflammation mediated acute phase response, complement and coagulation cascades, apolipoproteins and vitamin D metabolism pathway were identified in dogs with babesiosis.
Our findings confirmed two dominant pathogenic mechanisms of babesiosis, haemolysis and acute phase response. These results may provide possible serum biomarker candidates for clinical monitoring of babesiosis and this study could serve as the basis for further proteomic investigations in canine babesiosis.
Dog; Babesiosis; Acute phase proteins; Serum biomarkers; Proteomics; 2-dimensional electrophoresis
Current experimental techniques, especially those applying liquid chromatography mass spectrometry, have made high-throughput proteomic studies possible. The increase in throughput however also raises concerns on the accuracy of identification or quantification. Most experimental procedures select in a given MS scan only a few relatively most intense parent ions, each to be fragmented (MS2) separately, and most other minor co-eluted peptides that have similar chromatographic retention times are ignored and their information lost.
We have computationally investigated the possibility of enhancing the information retrieval during a given LC/MS experiment by selecting the two or three most intense parent ions for simultaneous fragmentation. A set of spectra is created via superimposing a number of MS2 spectra, each can be identified by all search methods tested with high confidence, to mimick the spectra of co-eluted peptides. The generated convoluted spectra were used to evaluate the capability of several database search methods – SEQUEST, Mascot, X!Tandem, OMSSA, and RAId_DbS – in identifying true peptides from superimposed spectra of co-eluted peptides. We show that using these simulated spectra, all the database search methods will gain eventually in the number of true peptides identified by using the compound spectra of co-eluted peptides.
Open peer review
Reviewed by Vlad Petyuk (nominated by Arcady Mushegian), King Jordan and Shamil Sunyaev. For the full reviews, please go to the Reviewers' comments section.
are biologically significant large molecules that
participate in numerous cellular activities. In order to obtain site-specific
protein glycosylation information, intact glycopeptides, with the
glycan attached to the peptide sequence, are characterized by tandem
mass spectrometry (MS/MS) methods such as collision-induced dissociation
(CID) and electron transfer dissociation (ETD). While several emerging
automated tools are developed, no consensus is present in the field
about the best way to determine the reliability of the tools and/or
provide the false discovery rate (FDR). A common approach to calculate
FDRs for glycopeptide analysis, adopted from the target-decoy strategy
in proteomics, employs a decoy database that is created based on the
target protein sequence database. Nonetheless, this approach is not
optimal in measuring the confidence of N-linked glycopeptide
matches, because the glycopeptide data set is considerably smaller
compared to that of peptides, and the requirement of a consensus sequence
for N-glycosylation further limits the number of
possible decoy glycopeptides tested in a database search. To address
the need to accurately determine FDRs for automated glycopeptide assignments,
we developed GlycoPep Evaluator (GPE), a tool that helps to measure
FDRs in identifying glycopeptides without using a decoy database.
GPE generates decoy glycopeptides de novo for every target glycopeptide,
in a 1:20 target-to-decoy ratio. The decoys, along with target glycopeptides,
are scored against the ETD data, from which FDRs can be calculated
accurately based on the number of decoy matches and the ratio of the
number of targets to decoys, for small data sets. GPE is freely accessible
for download and can work with any search engine that interprets ETD
data of N-linked glycopeptides. The software is provided
Human glycoproteins exhibit enormous heterogeneity at each N-glycosite, but few studies have attempted to globally characterize the site-specific structural features. We have developed Integrated GlycoProteome Analyzer (I-GPA) including mapping system for complex N-glycoproteomes, which combines methods for tandem mass spectrometry with a database search and algorithmic suite. Using an N-glycopeptide database that we constructed, we created novel scoring algorithms with decoy glycopeptides, where 95 N-glycopeptides from standard α1-acid glycoprotein were identified with 0% false positives, giving the same results as manual validation. Additionally automated label-free quantitation method was first developed that utilizes the combined intensity of top three isotope peaks at three highest MS spectral points. The efficiency of I-GPA was demonstrated by automatically identifying 619 site-specific N-glycopeptides with FDR ≤ 1%, and simultaneously quantifying 598 N-glycopeptides, from human plasma samples that are known to contain highly glycosylated proteins. Thus, I-GPA platform could make a major breakthrough in high-throughput mapping of complex N-glycoproteomes, which can be applied to biomarker discovery and ongoing global human proteome project.
The unambiguous assignment of tandem mass spectra (MS/MS) to peptide sequences remains a key unsolved problem in proteomics. Spectral library search strategies have emerged as a promising alternative for peptide identification, in which MS/MS spectra are directly compared against a reference library of confidently assigned spectra. Two problems relate to library size. First, reference spectral libraries are limited to rediscovery of previously identified peptides and are not applicable to new peptides, because of their incomplete coverage of the human proteome. Second, problems arise when searching a spectral library the size of the entire human proteome. We observed that traditional dot product scoring methods do not scale well with spectral library size, showing reduction in sensitivity when library size is increased. We show that this problem can be addressed by optimizing scoring metrics for spectrum-to-spectrum searches with large spectral libraries. MS/MS spectra for the 1.3 million predicted tryptic peptides in the human proteome are simulated using a kinetic fragmentation model (MassAnalyzer version2.1) to create a proteome-wide simulated spectral library. Searches of the simulated library increase MS/MS assignments by 24% compared with Mascot, when using probabilistic and rank based scoring methods. The proteome-wide coverage of the simulated library leads to 11% increase in unique peptide assignments, compared with parallel searches of a reference spectral library. Further improvement is attained when reference spectra and simulated spectra are combined into a hybrid spectral library, yielding 52% increased MS/MS assignments compared with Mascot searches. Our study demonstrates the advantages of using probabilistic and rank based scores to improve performance of spectrum-to-spectrum search strategies.