Tandem mass spectrometry (MS/MS) is a widely used method for proteome-wide analysis of protein expression and post-translational modifications (PTMs). The thousands of MS/MS spectra produced from a single experiment pose a major challenge for downstream analysis. Standard programs, such as Mascot, provide peptide assignments for many of the spectra, including identification of PTM sites, but these results are plagued by false positive identifications. In phosphoproteomics experiments only a single peptide assignment is typically available to support identification of each phosphorylation site, so minimizing false positives is critical. Thus, tedious manual validation is often required to increase confidence in the spectral assignments.
We have developed phoMSVal, an open-source platform for managing MS/MS data and automatically validating identified phosphopeptides. We tested five classification algorithms with 17 extracted features to separate correct peptide assignments from incorrect ones using over 3000 manually curated spectra. The naive Bayes algorithm was among the best classifiers with an area under the ROC curve value of 97% and positive predictive value of 97% for phosphotyrosine data. This classifier required only three features to achieve a 76% decrease in false positives as compared to Mascot while retaining 97% of true positives. This algorithm was able to classify an independent phosphoserine/threonine dataset with area under ROC curve value of 93% and positive predictive value of 91%, demonstrating the applicability of this method for all types of phospho-MS/MS data. PhoMSVal is available at http://csbi.ltdk.helsinki.fi/phomsval
bioinformatics; data management; feature selection; machine learning; phosphoproteomics
Phosphorylation site assignment of high throughput tandem mass spectrometry (LC-MS/MS) data is one of the most common and critical aspects of phosphoproteomics. Correctly assigning phosphorylated residues helps us understand their biological significance. The design of common search algorithms (such as Sequest, Mascot etc.) do not incorporate site assignment; therefore additional algorithms are essential to assign phosphorylation sites for mass spectrometry data. The main contribution of this study is the design and implementation of a linear time and space dynamic programming strategy for phosphorylation site assignment referred to as PhosSA. The proposed algorithm uses summation of peak intensities associated with theoretical spectra as an objective function. Quality control of the assigned sites is achieved using a post-processing redundancy criteria that indicates the signal-to-noise ratio properties of the fragmented spectra. The quality assessment of the algorithm was determined using experimentally generated data sets using synthetic peptides for which phosphorylation sites were known. We report that PhosSA was able to achieve a high degree of accuracy and sensitivity with all the experimentally generated mass spectrometry data sets. The implemented algorithm is shown to be extremely fast and scalable with increasing number of spectra (we report up to 0.5 million spectra/hour on a moderate workstation). The algorithm is designed to accept results from both Sequest and Mascot search engines. An executable is freely available at http://helixweb.nih.gov/ESBL/PhosSA/ for academic research purposes.
Phosphorylation site assignment of large-scale data from high throughput tandem mass spectrometry (LC-MS/MS) data is an important aspect of phosphoproteomics. Correct assignment of phosphorylated residue(s) is important for functional interpretation of the data within a biological context. Common search algorithms (Sequest etc.) for mass spectrometry data are not designed for accurate site assignment; thus, additional algorithms are needed. In this paper, we propose a linear-time and linear-space dynamic programming strategy for phosphorylation site assignment. The algorithm, referred to as PhosSA, optimizes the objective function defined as the summation of peak intensities that are associated with theoretical phosphopeptide fragmentation ions. Quality control is achieved through the use of a post-processing criteria whose value is indicative of the signal-to-noise (S/N) properties and redundancy of the fragmentation spectra. The algorithm is tested using experimentally generated data sets of peptides with known phosphorylation sites while varying the fragmentation strategy (CID or HCD) and molar amounts of the peptides. The algorithm is also compatible with various peptide labeling strategies including SILAC and iTRAQ. PhosSA is shown to achieve > 99% accuracy with a high degree of sensitivity. The algorithm is extremely fast and scalable (able to process up to 0.5 million peptides in an hour). The implemented algorithm is freely available at http://helixweb.nih.gov/ESBL/PhosSA/ for academic purposes.
Tandem mass spectrometry has become particularly useful for the rapid identification and characterization of protein components of complex biological mixtures. Powerful database search methods have been developed for the peptide identification, such as SEQUEST and MASCOT, which are implemented by comparing the mass spectra obtained from unknown proteins or peptides with theoretically predicted spectra derived from protein databases. However, the majority of spectra generated from a mass spectrometry experiment are of too poor quality to be interpreted while some of spectra with high quality cannot be interpreted by one method but perhaps by others. Hence a filtering algorithm that removes those spectra with poor quality prior to the database search is appealing.
This paper proposes a support vector machine (SVM) based approach to assess the quality of tandem mass spectra. Each mass spectrum is mapping into the 16 proposed features to describe its quality. Based the results from SEQUEST, four SVM classifiers with the input of the 16 features are trained and tested on ISB data and TOV data, respectively. The superior performance of the proposed SVM classifiers is illustrated both by the comparison with the existing classifiers and by the validation in terms of MASCOT search results.
The proposed method can be employed to effectively remove the poor quality spectra before the spectral searching, and also to find the more peptides or post-translational peptides from spectra with high quality using different search engines or de novo method.
Correct phosphorylation site assignment is a critical aspect of phosphoproteomic analysis. Large-scale phosphopeptide data sets that are generated through liquid chromatography-coupled tandem mass spectrometry (LC-MS/MS) analysis often contain hundreds or thousands of phosphorylation sites that require validation. To this end, we have created PhosphoScore, an open-source assignment program that is compatible with phosphopeptide data from multiple MS levels (MSn). The algorithm takes into account both the match quality and normalized intensity of observed spectral peaks compared to a theoretical spectrum. PhosphoScore produced >95% correct MS2 assignments from known synthetic data, >98% agreement with an established MS2 assignment algorithm (Ascore), and >92% agreement with visual inspection of MS3 and MS4 spectra.
MassMatrix is a program that matches tandem mass spectra with theoretical peptide sequences derived from a protein database. The program uses a mass accuracy sensitive probabilistic score model to rank peptide matches. The tandem mass spectrometry search software was evaluated by use of a high mass accuracy data set and its results compared with those from Mascot, SEQUEST, X!Tandem, and OMSSA. For the high mass accuracy data, MassMatrix provided better sensitivity than Mascot, SEQUEST, X!Tandem, and OMSSA for a given specificity and the percentage of false positives was 2%. More importantly all manually validated true positives corresponded to a unique peptide/spectrum match. The presence of decoy sequence and additional variable post-translational modifications did not significantly affect the results from the high mass accuracy search. MassMatrix performs well when compared with Mascot, SEQUEST, X!Tandem, and OMSSA with regard to search time. MassMatrix was also run on a distributed memory clusters and achieved search speeds of ~100,000 spectra per hour when searching against a complete human database with 8 variable modifications. The algorithm is available for public searches at http://www.massmatrix.net.
Tandem mass spectra; Database search; High mass accuracy; Proteomics; Post-translational modification
It is a major challenge to develop effective sequence database search algorithms to translate molecular weight and fragment mass information obtained from tandem mass spectrometry into high quality peptide and protein assignments. We investigated the peptide identification performance of Mascot and X!Tandem for mass tolerance settings common for low and high accuracy mass spectrometry. We demonstrated that sensitivity and specificity of peptide identification can vary substantially for different mass tolerance settings, but this effect was more significant for Mascot. We present an adjusted Mascot threshold, which allows the user to freely select the best trade-off between sensitivity and specificity. The adjusted Mascot threshold was compared with the default Mascot and X!Tandem scoring thresholds and shown to be more sensitive at the same false discovery rates for both low and high accuracy mass spectrometry data.
The promise of mass spectrometry as a tool for probing signal-transduction is predicated on reliable identification of post-translational modifications. Phosphorylations are key mediators of cellular signaling, yet are hard to detect, partly because of unusual fragmentation patterns of phosphopeptides. In addition to being accurate, MS/MS identification software must be robust and efficient to deal with increasingly large spectral data sets. Here, we present a new scoring function for the Inspect software for phosphorylated peptide tandem mass spectra for ion-trap instruments, without the need for manual validation. The scoring function was modeled by learning fragmentation patterns from 7677 validated phosphopeptide spectra. We compare our algorithm against SEQUEST and X!Tandem on testing and training data sets. At a 1% false positive rate, Inspect identified the greatest total number of phosphorylated spectra, 13% more than SEQUEST and 39% more than X!Tandem. Spectra identified by Inspect tended to score better in several spectral quality measures. Furthermore, Inspect runs much faster than either SEQUEST or X!Tandem, making desktop phosphoproteomics feasible. Finally, we used our new models to reanalyze a corpus of 423 000 LTQ spectra acquired for a phosphoproteome analysis of Saccharomyces cerevisiae DNA damage and repair pathways and discovered 43% more phosphopeptides than the previous study.
Phosphoproteomics; Scoring; High-throughput proteomics; Post-translational modifications
Database-search programs for peptide identification by tandem mass spectrometry ask their users to set various parameters, including precursor and fragment mass tolerances, digestion specificity, and allowed types of modifications. Even proteomics experts with detailed knowledge of their samples may find it difficult to make these choices without significant investigation, and poor choices can lead to missed identifications and misleading results. Here we describe a program called Preview that analyzes a set of mass spectra for mass errors, digestion specificity, and known and unknown modifications, thereby facilitating parameter selection. Moreover, Preview optionally recalibrates mass-over-charge measurements, leading to further improvement in identification results. In a study of Bruton’s tyrosine kinase, we find that the use of Preview improved the number of confidently identified mass spectra and phosphorylation sites by about 50%.
An important step in mass spectrometry (MS)-based proteomics is the identification of peptides by their fragment spectra. Regardless of the identification score achieved, almost all tandem-MS (MS/MS) spectra contain remaining peaks that are not assigned by the search engine. These peaks may be explainable by human experts but the scale of modern proteomics experiments makes this impractical. In computer science, Expert Systems are a mature technology to implement a list of rules generated by interviews with practitioners. We here develop such an Expert System, making use of literature knowledge as well as a large body of high mass accuracy and pure fragmentation spectra. Interestingly, we find that even with high mass accuracy data, rule sets can quickly become too complex, leading to over-annotation. Therefore we establish a rigorous false discovery rate, calculated by random insertion of peaks from a large collection of other MS/MS spectra, and use it to develop an optimized knowledge base. This rule set correctly annotates almost all peaks of medium or high abundance. For high resolution HCD data, median intensity coverage of fragment peaks in MS/MS spectra increases from 58% by search engine annotation alone to 86%. The resulting annotation performance surpasses a human expert, especially on complex spectra such as those of larger phosphorylated peptides. Our system is also applicable to high resolution collision-induced dissociation data. It is available both as a part of MaxQuant and via a webserver that only requires an MS/MS spectrum and the corresponding peptides sequence, and which outputs publication quality, annotated MS/MS spectra (www.biochem.mpg.de/mann/tools/). It provides expert knowledge to beginners in the field of MS-based proteomics and helps advanced users to focus on unusual and possibly novel types of fragment ions.
Tandem mass spectrometry (MS/MS) is frequently used in the identification of peptides and proteins. Typical proteomic experiments rely on algorithms such as SEQUEST and MASCOT to compare thousands of tandem mass spectra against the theoretical fragment ion spectra of peptides in a database. The probabilities that these spectrum-to-sequence assignments are correct can be determined by statistical software such as PeptideProphet or through estimations based on reverse or decoy databases. However, many of the software applications that assign probabilities for MS/MS spectra to sequence matches were developed using training datasets from 3D ion-trap mass spectrometers. Given the variety of types of mass spectrometers that have become commercially available over the last five years, we sought to generate a dataset of reference data covering multiple instrumentation platforms to facilitate both the refinement of existing computational approaches and the development of novel software tools. We analyzed the proteolytic peptides in a mixture of tryptic digests of 18 proteins, named the “ISB standard protein mix”, using 8 different mass spectrometers. These include linear and 3D ion traps, two quadrupole time-of-flight platforms (qq-TOF) and two MALDI-TOF-TOF platforms. The resulting dataset, which has been named the Standard Protein Mix Database, consists of over 1.1 million spectra in 150+ replicate runs on the mass spectrometers. The data were inspected for quality of separation and searched using SEQUEST. All data, including the native raw instrument and mzXML formats and the PeptideProphet validated peptide assignments, are available at http://regis-web.systemsbiology.net/PublicDatasets/.
Proteomics; reference dataset; database search software; standard protein mix; Standard Protein Mix Database
Large-scale phosphoproteomic analysis employing liquid chromatography-tandem mass spectrometry (LC–MS/MS) often requires a significant amount of manual manipulation of phosphopeptide datasets in the postacquisition phase. To assist in this process, we have created software, PhosphoPIC (PhosphoPeptide Identification and Compilation), which can perform a variety of useful functions including automated selection and compilation of phosphopeptide identifications from multiple MS levels, estimation of dataset false discovery rate, and application of appropriate cross-correlation (XCorr) filters. In addition, the output files generated by this program are compatible with downstream phosphorylation site assignment using the Ascore algorithm, as well as phosphopeptide quantification via QUOIL. In this report, we utilized this software to analyze phosphoproteins from short-term vasopressin-treated rat kidney inner medullary collecting duct (IMCD). A total of 925 phosphopeptides representing 173 unique proteins were identified from membrane-enriched fractions of IMCD with a false discovery rate of 1.5%. Of these proteins, 106 were found only in the membrane-enriched fraction of IMCD cells and not in whole IMCD cell lysates. These identifications included a number of well-studied ion and solute transporters including ClC-1, LAT4, MCT2, NBC3, and NHE1, all of which contained novel phosphorylation sites. Using a label-free quantification approach, we identified phosphoproteins that changed in abundance with vasopressin exposure including aquaporin-2 (AQP2), Hnrpa3, IP3 receptor 3, and pur-beta.
phosphoproteomics; neutral loss; target decoy; LC-MS/MS; collecting duct; IMCD; mass spectrometry; label free; PhosphoPIC; proteomics
Tandem mass spectrometry has become a remarkably powerful technology to identify proteins in proteomics. Bioinformatics tools, especially database searching tools, are essential for the interpretation of large quantities of proteomics data. Despite recent improvements in database searching algorithms, only a relatively small fraction of spectra can be confidently assigned to peptide sequences in a typical proteomics analysis. The remaining unassigned spectra often consist of low quality spectra that cause a significant amount of computational overhead but that contribute little to protein identification. On the other hand, many high quality spectra remain unassigned due to modifications, mutations, and the deficiencies of the scoring methods implemented in database searching tools. Here we present ScanRanker, an open-source algorithm that offers a robust method for spectral quality assessment. Unlike existing tools that require training software for each type of instrument to be employed, ScanRanker evaluates quality of tandem mass spectra via sequence tagging, providing reliable performance in data sets from different instruments. The superior performance of ScanRanker enables it not only to filter low quality spectra prior to database searching, but also to find unassigned high quality spectra that evade identification through database search.
Tandem mass spectrometry has emerged as a cornerstone of high throughput proteomic studies owing in part to various high throughput search engines which are used to interpret these tandem mass spectra. However, majority of experimental tandem mass spectra cannot be interpreted by any existing methods. There are many reasons why this happens. However, one of the most important reasons is that majority of experimental spectra are of too poor quality to be interpretable. It wastes time to interpret these "uninterpretable" spectra by any methods. On the other hand, some spectra of high quality are not able to get a score high enough to be interpreted by existing search engines because there are many similar peptides in the searched database. However, such spectra may be good enough to be interpreted by de novo methods or manually verifying methods. Therefore, it is worth in developing a method for assessing spectral quality, which can used for filtering the spectra of poor quality before any interpretation attempts or for finding the most potential candidates for de novo methods or manually verifying methods.
This paper develops a novel method to assess the quality of tandem mass spectra, which can eliminate majority of poor quality spectra while losing very minority of high quality spectra. First, a number of features are proposed to describe the quality of tandem mass spectra. The proposed method maps each tandem spectrum into a feature vector. Then Fisher linear discriminant analysis (FLDA) is employed to construct the classifier (the filter) which discriminates the high quality spectra from the poor quality ones. The proposed method has been tested on two tandem mass spectra datasets acquired by ion trap mass spectrometers.
Computational experiments illustrate that the proposed method outperforms the existing ones. The proposed method is generic, and is expected to be applicable to assessing the quality of spectra acquired by instruments other than ion trap mass spectrometers.
High-throughput mass spectroscopy data combined with a six-frame translation of the human genome can be used to identify novel protein encoding genes, as demonstrated with a search for plasma proteins.
Defining the location of genes and the precise nature of gene products remains a fundamental challenge in genome annotation. Interrogating tandem mass spectrometry data using genomic sequence provides an unbiased method to identify novel translation products. A six-frame translation of the entire human genome was used as the query database to search for novel blood proteins in the data from the Human Proteome Organization Plasma Proteome Project. Because this target database is orders of magnitude larger than the databases traditionally employed in tandem mass spectra analysis, careful attention to significance testing is required. Confidence of identification is assessed using our previously described Poisson statistic, which estimates the significance of multi-peptide identifications incorporating the length of the matching sequence, number of spectra searched and size of the target sequence database.
Applying a false discovery rate threshold of 0.05, we identified 282 significant open reading frames, each containing two or more peptide matches. There were 627 novel peptides associated with these open reading frames that mapped to a unique genomic coordinate placed within the start/stop points of previously annotated genes. These peptides matched 1,110 distinct tandem MS spectra. Peptides fell into four categories based upon where their genomic coordinates placed them relative to annotated exons within the parent gene.
This work provides evidence for novel alternative splice variants in many previously annotated genes. These findings suggest that annotation of the genome is not yet complete and that proteomics has the potential to further add to our understanding of gene structures.
Proteins can be separated first by pI during isoelectric focusing followed by molecular weight separation on a polyacrylamide gel. After in-gel tryptic digestion, the peptide products are introduced into to the mass spectrometer for LC-MS analysis. The spectra containing the peptide fingerprints can then be searched using Mascot. One advantage to using peptide mass fingerprinting for protein identification is that the molecular weight and pI information can be incorporated into Mascot searches to increase the confidence of the results. HRT mass spectrometry analysis of protein standards digested with trypsin introduced using static nanospray shows that protein mixtures can be reliably identified using the Mascot search algorithm provided that no more than two proteins are present in the mixture. HRT spectra from these empirical experiments showed good resolution (50,000) and high mass accuracy (<1 ppm). It should also be noted that the sensitivity can be increased (> 3-fold) using velocity modulation and the resolution increased to (<70,000) using zoom mode while still preserving mass accuracy. In-silco analysis of 500 proteins from the S. cerevisiae Swiss-Prot database reveals, that the likelihood of having a protein mixture of two a more proteins is 5.98% if the pI resolution equals 1 pH unit and MW resolution equals 500 Da; however, a more optimal separation of the proteins where the pI resolution equals 0.05 pH unit and MW resolution equals 100 Da shows that protein mixtures of two or more proteins occur at a frequency of less than 1%. By achieving optimal protein separation, digested proteins can be directly infused into the mass spectrometer using static nanospray. Without further LC-based separation, proteins can be identified more expediently without sacrificing the ability to reliably identify them using peptide mass fingerprinting.
High-throughput spectrometers are capable of producing data sets containing thousands of spectra for a single biological sample. These data sets contain a substantial amount of redundancy from peptides that may get selected multiple times in a LC-MS/MS experiment. In this paper, we present an efficient algorithm, CAMS (Clustering Algorithm for Mass Spectra) for clustering mass spectrometry data which increases both the sensitivity and confidence of spectral assignment. CAMS utilizes a novel metric, called F-set, that allows accurate identification of the spectra that are similar. A graph theoretic framework is defined that allows the use of F-set metric efficiently for accurate cluster identifications. The accuracy of the algorithm is tested on real HCD and CID data sets with varying amounts of peptides. Our experiments show that the proposed algorithm is able to cluster spectra with very high accuracy in a reasonable amount of time for large spectral data sets. Thus, the algorithm is able to decrease the computational time by compressing the data sets while increasing the throughput of the data by interpreting low S/N spectra.
Clustering; Mass spectrometry; Graph Theory; Efficient Algorithms
The development of liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has made it possible to measure phosphopeptides on an increasingly large-scale and high-throughput fashion. However, extracting confident phosphopeptide identifications from the resulting large dataset in a similar high-throughput fashion remains difficult, as does rigorously estimating the false discovery rate (FDR) of a set of phosphopeptide identifications. This article describes a data analysis pipeline designed to address these issues. The first step is to re-analyze phosphopeptide identifications that contain ambiguous assignments for the incorporated phosphate(s) to determine the most likely arrangement of the phosphate(s). The next step is to employ an expectation maximization algorithm to estimate the joint distribution of the SEQUEST scores. A linear discriminant analysis is then performed to determine how to optimally combine peptide scores (in this case, SEQUEST) into a discriminant score that possesses the maximum discriminating power. Based on this discriminant score, the p- and q-values for each phosphopeptide identification are calculated, and the phosphopeptide identification FDR is then estimated. This data analysis approach was applied to data from a study of irradiated human skin fibroblasts to provide a robust estimate of FDR for phosphopeptides, and has been coded into a software package that is freely available (http://ncrr.pnl.gov/downloads/data/Du2008_Supplementary_Data.zip).
False Discovery Rate; phosphoproteomics; expectation maximization; linear discriminant analysis; p-value; q-value; Bayesian analysis
High-throughput shotgun proteomics data contain a significant number of spectra from non-peptide ions or spectra of too poor quality to obtain highly confident peptide identifications. These spectra cannot be identified with any positive peptide matches in some database search programs or are identified with false positives in others. Removing these spectra can improve the database search results and lower computational expense.
A new algorithm has been developed to filter tandem mass spectra of poor quality from shotgun proteomic experiments. The algorithm determines the noise level dynamically and independently for each spectrum in a tandem mass spectrometric data set. Spectra are filtered based on a minimum number of required signal peaks with a signal-to-noise ratio of 2. The algorithm was tested with 23 sample data sets containing 62,117 total spectra.
The spectral screening removed 89.0% of the tandem mass spectra that did not yield a peptide match when searched with the MassMatrix database search software. Only 6.0% of tandem mass spectra that yielded peptide matches considered to be true positive matches were lost after spectral screening. The algorithm was found to be very effective at removal of unidentified spectra in other database search programs including Mascot, OMSSA, and X!Tandem (75.93%-91.00%) with a small loss (3.59%-9.40%) of true positive matches.
We developed and compared two approaches for automated validation of phosphopeptides tandem mass spectra identified using database searching algorithms. Phosphopeptide identifications were obtained through SEQUEST searches of a protein database appended with its decoy (reversed sequences). Statistical evaluation and iterative searches were employed to create a high quality dataset of phosphopeptides. Automation of post-search validation was approached by two different strategies. By using statistical multiple testing, we calculate a p-value for each tentative peptide phosphorylation. In a second method, we use a support vector machine (a machine learning algorithm) binary classifier to predict whether a tentative peptide phosphorylation is true or not. We show good agreement (85%) between post-search validation of phosphopeptide/spectrum matches by multiple testing and that from support vector machines. Automatic methods confirm very well with manual expert validation in a blinded test. Additionally, the algorithms were tested on the identification of synthetic phosphopeptides. We show that phosphate neutral losses in tandem mass spectra can be used to assess the correctness of phosphopeptide/spectrum matches. An SVM classifier with a radial basis function provided classification accuracy from 95.7% to 96.8% of the positive dataset, depending on search algorithm used. Establishing the efficacy of an identification is a necessary step for further post-search interrogation of the spectra for complete localization of phosphorylation sites. Our current implementation performs validation of phosphoserine/phosphothreonine containing peptides having 1 or 2 phosphorylation sites from data gathered on an ion trap mass spectrometer. The SVM-based algorithm has been implemented in a software package DeBunker. We illustrate the application of the SVM-based software DeBunker on a large phosphorylation dataset.
In shotgun proteomics, protein identification by tandem mass spectrometry relies on bioinformatics tools. Despite recent improvements in identification algorithms, a significant number of high quality spectra remain unidentified for various reasons. Here we present ScanRanker, an open-source tool that evaluates the quality of tandem mass spectra via sequence tagging with reliable performance in data from different instruments. The superior performance of ScanRanker enables it not only to find unassigned high quality spectra that evade identification through database search, but also to select spectra for de novo sequencing and cross-linking analysis. In addition, we demonstrate that the distribution of ScanRanker scores predicts the richness of identifiable spectra among multiple LC-MS/MS runs in an experiment, and ScanRanker scores assist the process of peptide assignment validation to increase confident spectrum identifications. The source code and executable versions of ScanRanker are available from http://fenchurch.mc.vanderbilt.edu.
spectral quality; sequence tagging; bioinformatics; tandem mass spectrometry; cross-linking
Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughput nature, shotgun proteomics faces challenges with respect to the analysis and interpretation of experimental data. Among such challenges, the identification of proteins present in a sample has been recognized as an important computational task. This task generally consists of (1) assigning experimental tandem mass spectra to peptides derived from a protein database, and (2) mapping assigned peptides to proteins and quantifying the confidence of identified proteins. Protein identification is fundamentally a statistical inference problem with a number of methods proposed to address its challenges. In this review we categorize current approaches into rule-based, combinatorial optimization and probabilistic inference techniques, and present them using integer programing and Bayesian inference frameworks. We also discuss the main challenges of protein identification and propose potential solutions with the goal of spurring innovative research in this area.
Automated database search engines are one of the fundamental engines of high-throughput proteomics enabling daily identifications of hundreds of thousands of peptides and proteins from tandem mass (MS/MS) spectrometry data. Nevertheless, this automation also makes it humanly impossible to manually validate the vast lists of resulting identifications from such high-throughput searches. This challenge is usually addressed by using a Target-Decoy Approach (TDA) to impose an empirical False Discovery Rate (FDR) at a pre-determined threshold x% with the expectation that at most x% of the returned identifications would be false positives. But despite the fundamental importance of FDR estimates in ensuring the utility of large lists of identifications, there is surprisingly little consensus on exactly how TDA should be applied to minimize the chances of biased FDR estimates. In fact, since less rigorous TDA/FDR estimates tend to result in more identifications (at higher 'true' FDR), there is often little incentive to enforce strict TDA/FDR procedures in studies where the major metric of success is the size of the list of identifications and there are no follow up studies imposing hard cost constraints on the number of reported false positives.
Here we address the problem of the accuracy of TDA estimates of empirical FDR. Using MS/MS spectra from samples where we were able to define a factual FDR estimator of 'true' FDR we evaluate several popular variants of the TDA procedure in a variety of database search contexts. We show that the fraction of false identifications can sometimes be over 10× higher than reported and may be unavoidably high for certain types of searches. In addition, we further report that the two-pass search strategy seems the most promising database search strategy.
While unavoidably constrained by the particulars of any specific evaluation dataset, our observations support a series of recommendations towards maximizing the number of resulting identifications while controlling database searches with robust and reproducible TDA estimation of empirical FDR.
Mass spectrometry based analysis of post-translational modifications commonly report thousands of modified-peptide identifications accompanied by both precisely and ambiguously localized modification sites. Since these identifications often motivate extensive follow up studies, the confident identification of the peptide and accurate localization of the modification site(s) remains one of the major challenges in computational proteomics. As revealed by the 2010 iPRG study on identification of phosphopeptides and localization of phosphorylation sites, participants only attempted to call the modification sites for less than 2 out of every 3 identified spectra and actually disagreed on over 20% of all cases where at least two participants called a modification site. In this talk we will cover current and novel methods for identification of post-translationally modified peptides and automated determination of site localization confidence scores and false discovery rates.
Confident identification of peptides via tandem mass spectrometry underpins modern high-throughput proteomics. This has motivated considerable recent interest in the post-processing of search engine results to increase confidence and calculate robust statistical measures, for example through the use of decoy databases to calculate false discovery rates (FDR). FDR-based analyses allow for multiple testing and can assign a single confidence value for both sets and individual peptide spectrum matches (PSMs). We recently developed an algorithm for combining the results from multiple search engines, integrating FDRs for sets of PSMs made by different search engine combinations. Here we describe a web-server, and a downloadable application, which makes this routinely available to the proteomics community. The web server offers a range of outputs including informative graphics to assess the confidence of the PSMs and any potential biases. The underlying pipeline provides a basic protein inference step, integrating PSMs into protein ambiguity groups where peptides can be matched to more than one protein. Importantly, we have also implemented full support for the mzIdentML data standard, recently released by the Proteomics Standards Initiative, providing users with the ability to convert native formats to mzIdentML files, which are available to download.
bioinformatics; false discovery rate; multiple search engines; web server; data standards