Shotgun proteomics experiments are dependent upon database search engines to identify peptides from tandem mass spectra. Many of these algorithms score potential identifications by evaluating the number of fragment ions matched between each peptide sequence and an observed spectrum. These systems, however, generally do not distinguish between matching an intense peak and matching a minor peak. We have developed a statistical model to score peptide matches that is based upon the multivariate hypergeometric distribution. This scorer, part of the “MyriMatch” database search engine, places greater emphasis on matching intense peaks. The probability that the best match for each spectrum has occurred by random chance can be employed to separate correct matches from random ones. We evaluated this software on data sets from three different laboratories employing three different ion trap instruments. Employing a novel system for testing discrimination, we demonstrate that stratifying peaks into multiple intensity classes improves the discrimination of scoring. We compare MyriMatch results to those of Sequest and X!Tandem, revealing that it is capable of higher discrimination than either of these algorithms. When minimal peak filtering is employed, performance plummets for a scoring model that does not stratify matched peaks by intensity. On the other hand, we find that MyriMatch discrimination improves as more peaks are retained in each spectrum. MyriMatch also scales well to tandem mass spectra from high-resolution mass analyzers. These findings may indicate limitations for existing database search scorers that count matched peaks without differentiating them by intensity. This software and source code is available under Mozilla Public License at this URL: http://www.mc.vanderbilt.edu/msrc/bioinformatics/.
Proteomics; Identification; Statistical Distribution; Reversed Database; Peak Filtering
Database-searching programs generally identify only a fraction of the spectra acquired in a standard LC/MS/MS study of digested proteins. Subtle variations in database-searching algorithms of MS/MS spectra have been known to provide different identification results. To leverage this variation, we developed Scaffold to probabilistically combine the results of multiple search engines, including Sequest, Mascot, and X!Tandem. Here we present a “tell all” explanation of the specific methodology behind Scaffold that converts scores into search engine independent peptide probabilities. These probabilities can be readily combined across search engines using Bayesian rules and the Expectation Maximization learning algorithm. We demonstrate how we normally gain 20% to 100% more highly confident (>95%) MS/MS spectrum identifications with each additional search engine, which is primarily due to increased confidence in low-scoring matches. We also show that this method works reliably across a variety of search engines and instrumentation platforms without re-tuning.
In shotgun proteomics, protein identification by tandem mass spectrometry relies on bioinformatics tools. Despite recent improvements in identification algorithms, a significant number of high quality spectra remain unidentified for various reasons. Here we present ScanRanker, an open-source tool that evaluates the quality of tandem mass spectra via sequence tagging with reliable performance in data from different instruments. The superior performance of ScanRanker enables it not only to find unassigned high quality spectra that evade identification through database search, but also to select spectra for de novo sequencing and cross-linking analysis. In addition, we demonstrate that the distribution of ScanRanker scores predicts the richness of identifiable spectra among multiple LC-MS/MS runs in an experiment, and ScanRanker scores assist the process of peptide assignment validation to increase confident spectrum identifications. The source code and executable versions of ScanRanker are available from http://fenchurch.mc.vanderbilt.edu.
spectral quality; sequence tagging; bioinformatics; tandem mass spectrometry; cross-linking
Shotgun proteomics coupled with database search software allows the identification of a large number of peptides in a single experiment. However, some existing search algorithms, such as SEQUEST, use score functions that are designed primarily to identify the best peptide for a given spectrum. Consequently, when comparing identifications across spectra, the SEQUEST score function Xcorr fails to discriminate accurately between correct and incorrect peptide identifications. Several machine learning methods have been proposed to address the resulting classification task of distinguishing between correct and incorrect peptide-spectrum matches (PSMs). A recent example is Percolator, which uses semi-supervised learning and a decoy database search strategy to learn to distinguish between correct and incorrect PSMs identified by a database search algorithm. The current work describes three improvements to Percolator. (1) Percolator’s heuristic optimization is replaced with a clear objective function, with intuitive reasons behind its choice. (2) Tractable nonlinear models are used instead of linear models, leading to improved accuracy over the original Percolator. (3) A method, Q-ranker, for directly optimizing the number of identified spectra at a specified q value is proposed, which achieves further gains.
shotgun proteomics; tandem mass spectrometry; machine learning; peptide identification
Peptide characterization using electron transfer dissociation (ETD) is an important analytical tool for protein identification. The fragmentation observed in ETD spectra is complementary to that seen when using the traditional dissociation method, collision activated dissociation (CAD). Applications of ETD enhance the scope and complexity of the peptides that can be studied by mass spectrometry-based methods. For example, ETD is shown to be particularly useful for the study of post-translationally modified peptides.
To take advantage of the power provided by ETD, it is important to have an ETD-specific database search engine - an integral tool of mass spectrometry-based analytical proteomics. In this paper, we report on our development of a database search engine using ETD spectra and protein sequence databases to identify peptides. The search engine is based on the probabilistic modeling of shared peaks count and shared peaks intensity between the spectra and the peptide sequences. The shared peaks count accounts for the cumulative variations from amino acid sequences, while shared peaks intensity models the variations between the candidate sequence and product ion intensities. To demonstrate the utility of this algorithm for searching real-world data, we present the results of applications of this model to two high throughput data sets. Both data sets were obtained from yeast whole cell lysates. The first data set was obtained from a sample digested by Lys-C and the second data set was obtained by a digestion using trypsin. We searched the data sets against a combined forward and reversed yeast protein database to estimate false discovery rates. We compare the search results from the new methods with the results from a search engine often employed for ETD spectra, OMSSA. Our findings show that overall the new model performs comparably to OMSSA for low false discovery rates. At the same time, we demonstrate that there are substantial differences with OMSSA for results on subsets of data. Therefore, we conclude the new model can be considered as being complementary to previously developed models.
We also analyze the effect of the precursor mass accuracy on the false discovery rates of peptide identifications. It is shown that a substantial (30%) improvement on false discovery rates is achieved by the use of the mass accuracy information in combination with the database search results.
Tandem mass spectrometry-based shotgun proteomics has become a widespread technology for analyzing complex protein mixtures. A number of database searching algorithms have been developed to assign peptide sequences to tandem mass spectra. Assembling the peptide identifications to proteins, however, is a challenging issue because many peptides are shared among multiple proteins. IDPicker is an open-source protein assembly tool that derives a minimum protein list from peptide identifications filtered to a specified False Discovery Rate. Here, we update IDPicker to increase confident peptide identifications by combining multiple scores produced by database search tools. By segregating peptide identifications for thresholding using both the precursor charge state and the number of tryptic termini, IDPicker retrieves more peptides for protein assembly. The new version is more robust against false positive proteins, especially in searches using multispecies databases, by requiring additional novel peptides in the parsimony process. IDPicker has been designed for incorporation in many identification workflows by the addition of a graphical user interface and the ability to read identifications from the pepXML format. These advances position IDPicker for high peptide discrimination and reliable protein assembly in large-scale proteomics studies. The source code and binaries for the latest version of IDPicker are available from http://fenchurch.mc.vanderbilt.edu/.
bioinformatics; parsimony; protein assembly; protein inference; false discovery rate
Obtaining accurate peptide identifications from shotgun proteomics liquid chromatography tandem mass spectrometry (LC-MS/MS) experiments requires a score function that consistently ranks correct peptide-spectrum matches (PSMs) above incorrect matches. We have observed that, for the Sequest score function X corr, the inability to discriminate between correct and incorrect PSMs is due in part to spectrum-specific properties of the score distribution. In other words, some spectra score well regardless of which peptides they are scored against, and other spectra score well because they are scored against a large number of peptides. We describe a protocol for calibrating PSM score functions, and we demonstrate its application to X corr and the preliminary Sequest score function Sp. The protocol accounts for spectrum- and peptide-specific effects by calculating p values for each spectrum individually, using only that spectrum’s score distribution. We demonstrate that these calculated p values are uniform under a null distribution and therefore accurately measure significance. These p values can be used to estimate the false discovery rate, therefore eliminating the need for an extra search against a decoy database. In addition, we show that the p values are better calibrated than their underlying scores; consequently, when ranking top-scoring PSMs from multiple spectra, p values are better at discriminating between correct and incorrect PSMs. The calibration protocol is generally applicable to any PSM score function for which an appopriate parametric family can be identified.
calibration; database search; peptide identification; tandem mass spectrometry
Tandem mass spectrometry has become a remarkably powerful technology to identify proteins in proteomics. Bioinformatics tools, especially database searching tools, are essential for the interpretation of large quantities of proteomics data. Despite recent improvements in database searching algorithms, only a relatively small fraction of spectra can be confidently assigned to peptide sequences in a typical proteomics analysis. The remaining unassigned spectra often consist of low quality spectra that cause a significant amount of computational overhead but that contribute little to protein identification. On the other hand, many high quality spectra remain unassigned due to modifications, mutations, and the deficiencies of the scoring methods implemented in database searching tools. Here we present ScanRanker, an open-source algorithm that offers a robust method for spectral quality assessment. Unlike existing tools that require training software for each type of instrument to be employed, ScanRanker evaluates quality of tandem mass spectra via sequence tagging, providing reliable performance in data sets from different instruments. The superior performance of ScanRanker enables it not only to filter low quality spectra prior to database searching, but also to find unassigned high quality spectra that evade identification through database search.
The target-decoy approach to estimating and controlling false discovery rate (FDR) has become a de facto standard in shotgun proteomics, and it has been applied at both the peptide-to-spectrum match (PSM) and protein levels. Current bioinformatics methods control either the PSM- or the protein-level FDR, but not both. In order to obtain the most reliable information from their data, users must employ one method when the number of tandem mass spectra exceeds the number of proteins in the database and another method when the reverse is true. Here we propose a simple variation of the standard target-decoy strategy that estimates and controls PSM and protein FDRs simultaneously, regardless of the relative numbers of spectra and proteins. We demonstrate that even if the final goal is a list of PSMs with a fixed low FDR and not a list of protein identifications, the proposed two-dimensional strategy offers advantages over a pure PSM-level strategy.
Mass spectrometry; target decoy strategy; false discovery rate; peptide identification; protein identification
Motivation: A mass spectrum produced via tandem mass spectrometry can be tentatively matched to a peptide sequence via database search. Here, we address the problem of assigning a posterior error probability (PEP) to a given peptide-spectrum match (PSM). This problem is considerably more difficult than the related problem of estimating the error rate associated with a large collection of PSMs. Existing methods for estimating PEPs rely on a parametric or semiparametric model of the underlying score distribution.
Results: We demonstrate how to apply non-parametric logistic regression to this problem. The method makes no explicit assumptions about the form of the underlying score distribution; instead, the method relies upon decoy PSMs, produced by searching the spectra against a decoy sequence database, to provide a model of the null score distribution. We show that our non-parametric logistic regression method produces accurate PEP estimates for six different commonly used PSM score functions. In particular, the estimates produced by our method are comparable in accuracy to those of PeptideProphet, which uses a parametric or semiparametric model designed specifically to work with SEQUEST. The advantage of the non-parametric approach is applicability and robustness to new score functions and new types of data.
Availability: C++ code implementing the method as well as supplementary information is available at http://noble.gs.washington.edu/proj/qvality
The key to mass-spectrometry-based proteomics is peptide identification, which relies on software analysis of tandem mass spectra. Although each search engine has its strength, combining the strengths of various search engines is not yet realizable largely due to the lack of a unified statistical framework that is applicable to any method.
We have developed a universal scheme for statistical calibration of peptide identifications. The protocol can be used for both de novo approaches as well as database search methods. We demonstrate the protocol using only the database search methods. Among seven methods -SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X!Tandem (v1.0), OMSSA (v2.0) and RAId_DbS – calibrated, except for X!Tandem and RAId_DbS most methods require a rescaling according to the database size searched. We demonstrate that our calibration protocol indeed produces unified statistics both in terms of average number of false positives and in terms of the probability for a peptide hit to be a true positive. Although both the protocols for calibration and the statistics thus calibrated are universal, the calibration formulas obtained from one laboratory with data collected using either centroid or profile format may not be directly usable by the other laboratories. Thus each laboratory is encouraged to calibrate the search methods it intends to use. We also address the importance of using spectrum-specific statistics and possible improvement on the current calibration protocol. The spectra used for statistical (E-value) calibration are freely available upon request.
Open peer review
Reviewed by Dongxiao Zhu (nominated by Arcady Mushegian), Alexey Nesvizhskii (nominated by King Jordan) and Vineet Bafna. For the full reviews, please go to the Reviewers' comments section.
Spectral libraries have emerged as a viable alternative to protein sequence databases for peptide identification. These libraries contain previously detected peptide sequences and their corresponding tandem mass spectra (MS/MS). Search engines can then identify peptides by comparing experimental MS/MS scans to those in the library. Many of these algorithms employ the dot product score for measuring the quality of a spectrum-spectrum match (SSM). This scoring system does not offer a clear statistical interpretation and ignores fragment ion m/z discrepancies in the scoring. We developed a new spectral library search engine, Pepitome, which employs statistical systems for scoring SSMs. Pepitome outperformed the leading library search tool, SpectraST, when analyzing data sets acquired on three different mass spectrometry platforms. We characterized the reliability of spectral library searches by confirming shotgun proteomics identifications through RNA-Seq data. Applying spectral library and database searches on the same sample revealed their complementary nature. Pepitome identifications enabled the automation of quality analysis and quality control (QA/QC) for shotgun proteomics data acquisition pipelines.
Tandem mass spectrometry, run in combination with liquid chromatography (LC-MS/MS), can generate large numbers of peptide and protein identifications, for which a variety of database search engines are available. Distinguishing correct identifications from false positives is far from trivial because all data sets are noisy, and tend to be too large for manual inspection, therefore probabilistic methods must be employed to balance the trade-off between sensitivity and specificity. Decoy databases are becoming widely used to place statistical confidence in results sets, allowing the false discovery rate (FDR) to be estimated. It has previously been demonstrated that different MS search engines produce different peptide identification sets, and as such, employing more than one search engine could result in an increased number of peptides being identified. However, such efforts are hindered by the lack of a single scoring framework employed by all search engines.
We have developed a search engine independent scoring framework based on FDR which allows peptide identifications from different search engines to be combined, called the FDRScore. We observe that peptide identifications made by three search engines are infrequently false positives, and identifications made by only a single search engine, even with a strong score from the source search engine, are significantly more likely to be false positives. We have developed a second score based on the FDR within peptide identifications grouped according to the set of search engines that have made the identification, called the combined FDRScore. We demonstrate by searching large publicly available data sets that the combined FDRScore can differentiate between between correct and incorrect peptide identifications with high accuracy, allowing on average 35% more peptide identifications to be made at a fixed FDR than using a single search engine.
proteomics; mass spectrometry; decoy database; search engine; scoring; false discovery rate
A key problem in computational proteomics is distinguishing between correct and false peptide identifications. We argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. We show that the generating functions and their derivatives (spectral energy and spectral probability) represent new features of tandem mass spectra that, similarly to Δ-scores, significantly improve peptide identifications. Furthermore, the spectral probability provides a rigorous solution to the problem of computing statistical significance of spectral identifications. The spectral energy/probability approach improves the sensitivity-specificity trade-off of existing MS/MS search tools, addresses the notoriously difficult problem of “one-hit-wonders” in mass spectrometry, and often eliminates the need for decoy database searches. We therefore argue that the generating function approach has the potential to increase the number of peptide identifications in MS/MS searches.
The shotgun strategy (liquid chromatography coupled with tandem mass spectrometry) is widely applied for identification of proteins in complex mixtures. This method gives rise to thousands of spectra in a single run, which are interpreted by computational tools. Such tools normally use a protein database from which peptide sequences are extracted for matching with experimentally derived mass spectral data. After the database search, the correctness of obtained peptide-spectrum matches (PSMs) needs to be evaluated also by algorithms, as a manual curation of these huge datasets would be impractical. The target-decoy database strategy is largely used to perform spectrum evaluation. Nonetheless, this method has been applied without considering sensitivity, i.e., only error estimation is taken into account. A recently proposed method termed MUDE treats the target-decoy analysis as an optimization problem, where sensitivity is maximized. This method demonstrates a significant increase in the retrieved number of PSMs for a fixed error rate. However, the MUDE model is constructed in such a way that linear decision boundaries are established to separate correct from incorrect PSMs. Besides, the described heuristic for solving the optimization problem has to be executed many times to achieve a significant augmentation in sensitivity.
Here, we propose a new method, termed MUMAL, for PSM assessment that is based on machine learning techniques. Our method can establish nonlinear decision boundaries, leading to a higher chance to retrieve more true positives. Furthermore, we need few iterations to achieve high sensitivities, strikingly shortening the running time of the whole process. Experiments show that our method achieves a considerably higher number of PSMs compared with standard tools such as MUDE, PeptideProphet, and typical target-decoy approaches.
Our approach not only enhances the computational performance, and thus the turn around time of MS-based experiments in proteomics, but also improves the information content with benefits of a higher proteome coverage. This improvement, for instance, increases the chance to identify important drug targets or biomarkers for drug development or molecular diagnostics.
Machine learning; Bioinformatics; Peptide/protein identification; Shotgun proteomics; Phosphoproteomics; Tandem mass spectrometry
We developed an informatic method to identify tandem mass spectra composed of chemically cross-linked peptides from those of linear peptides and to assign sequence to each of the two unique peptide sequences. For a given set of proteins the key software tool, xComb, combs through all theoretically feasible cross-linked peptides to create a database consisting of a subset of all combinations represented as peptide FASTA files. The xComb library of select theoretical cross-linked peptides may then be used as a database that is examined by a standard proteomic search engine to match tandem mass spectral datasets to identify cross-linked peptides. The database search may be conducted against as many as 50 proteins with a number of common proteomic search engines, e.g. Phenyx, Sequest, OMSSA, Mascot and X!Tandem. By searching against a peptide library of linearized, cross-linked peptides, rather than a linearized protein library, search times are decreased and the process is decoupled from any specific search engine. A further benefit of decoupling from the search engine is that protein cross-linking studies may be conducted with readily available informatics tools for which scoring routines already exist within the proteomic community.
The main challenge of tandem mass spectrometry based proteomic analysis is to correctly match the tandem mass spectra produced to the correct peptides. However, the large number of protein sequences in a database increases the chances of a false positive identification for any given peptide match. Here we present an automated algorithm called IDSieve that utilizes target-decoy database search strategy in combination with pI filtering to allow greater confidence for peptide identifications. IDSieve considers the SEQUEST parameters Xcorr and äCn to assign statistical confidence (false discovery rates) to the peptide matches. The distribution of predicted pI values for peptide spectrum matches (PSMs) is considered separately for each immobilized pH gradient isoelectric focusing fraction, and matches with pI values within 1.5 times inter-quartile range (within pI range) are analyzed independently of matches outside the pI ranges. We tested the performance of IDSieve and Peptide/Protein Prophet on the SEQUEST outputs from 60 immobilized pH gradient isoelectric focusing fractions derived from mouse intestinal epithelial cell protein extracts. Our results demonstrated that IDSieve produced 1355 more peptide spectrum matches (or 330 more peptides) than Peptide Prophet using comparable false positive rate cutoffs. Therefore, combining pI filtering with the appropriate statistical significance measurements allows for a higher number of protein identifications without adversely affecting the false positive rate. We further tested the performance of pI filtering using ID Sieve when samples were prefractionated using either pH range 3.5–4.5 or 3–10, and either 24cm or 7cm IPG strips.
Tandem mass spectrometry has emerged as a cornerstone of high throughput proteomic studies owing in part to various high throughput search engines which are used to interpret these tandem mass spectra. However, majority of experimental tandem mass spectra cannot be interpreted by any existing methods. There are many reasons why this happens. However, one of the most important reasons is that majority of experimental spectra are of too poor quality to be interpretable. It wastes time to interpret these "uninterpretable" spectra by any methods. On the other hand, some spectra of high quality are not able to get a score high enough to be interpreted by existing search engines because there are many similar peptides in the searched database. However, such spectra may be good enough to be interpreted by de novo methods or manually verifying methods. Therefore, it is worth in developing a method for assessing spectral quality, which can used for filtering the spectra of poor quality before any interpretation attempts or for finding the most potential candidates for de novo methods or manually verifying methods.
This paper develops a novel method to assess the quality of tandem mass spectra, which can eliminate majority of poor quality spectra while losing very minority of high quality spectra. First, a number of features are proposed to describe the quality of tandem mass spectra. The proposed method maps each tandem spectrum into a feature vector. Then Fisher linear discriminant analysis (FLDA) is employed to construct the classifier (the filter) which discriminates the high quality spectra from the poor quality ones. The proposed method has been tested on two tandem mass spectra datasets acquired by ion trap mass spectrometers.
Computational experiments illustrate that the proposed method outperforms the existing ones. The proposed method is generic, and is expected to be applicable to assessing the quality of spectra acquired by instruments other than ion trap mass spectrometers.
Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples. Traditional search engines, which match peptide sequences with tandem mass spectra to identify the samples' proteins, use protein sequence databases to suggest peptide candidates for consideration. Although the acquisition of tandem mass spectra is not biased toward well-understood protein isoforms, this computational strategy is failing to identify peptides from alternative splicing and coding SNP protein isoforms despite the acquisition of good-quality tandem mass spectra. We propose, instead, that expressed sequence tags (ESTs) be searched. Ordinarily, such a strategy would be computationally infeasible due to the size of EST sequence databases; however, we show that a sophisticated sequence database compression strategy, applied to human ESTs, reduces the sequence database size approximately 35-fold. Once compressed, our EST sequence database is comparable in size to other commonly used protein sequence databases, making routine EST searching feasible. We demonstrate that our EST sequence database enables the discovery of novel peptides in a variety of public data sets.
alternative splicing; expressed sequence tags; proteomics; sequence compression; tandem mass spectra
Recently there has been an increasing interest in using spectral searching as an alternative to traditional database sequence searching methods for peptide identification from tandem mass spectrometry. In spectral searching, the query spectrum is compared to a carefully compiled library of previously observed and identified spectra; high spectral similarity signals positive identification. We have previously developed an open-source software toolkit, SpectraST, to enable proteomics researchers to integrate spectral searching into their data analysis pipeline. Here we report an additional module to SpectraST that provides the functionality of spectral library building, allowing users to build custom libraries when public spectral libraries do not adequately meet their needs. A consensus creation algorithm was developed to coalesce replicate spectra identified to the same peptide ion. Various quality filters were implemented to remove questionable and low-quality spectra from the library. To validate the methodology, we first compiled a spectral library from the 1.3 million SEQUEST-identified spectra (29,109 distinct peptide ions) among the publicly released datasets in the Human Plasma PeptideAtlas, a collection of 40 contributed, heterogeneous shotgun proteomics datasets, and verified the effectiveness of the library building algorithm to generate high-quality, representative consensus spectra and to remove questionable spectra. We then re-searched the same datasets by SpectraST against this spectral library filtered at different quality levels, and used the performance as a benchmark to evaluate our library building methods and to determine key parameters for high-quality library building. We demonstrated the importance of library quality on the performance of spectral searching. The ready-to-deploy software allows individual researchers to easily condense their raw data into specialized spectral libraries, summarizing useful information about their observed proteomes into a concise and retrievable format for future data analyses.
In shotgun proteomics, the analysis of tandem mass spectrometry data from peptides can benefit greatly from high mass accuracy measurements. In this study we have evaluated two database search strategies which use high mass accuracy measurements of the peptide precursor ion. Our results indicate that peptide identifications are improved when spectra are searched with a wide mass tolerance window and precursor mass is used as a filter to discard incorrect matches. Database searches with a peptide dataset constrained to peptides within a narrow mass window resulted in fewer peptide identifications but a significantly faster database search time.
Proteomics; Database search; Precursor estimation; Mass accuracy
The analysis of the large volume of tandem mass spectrometry (MS/MS) proteomics data that is generated these days relies on automated algorithms that identify peptides from their mass spectra. An essential component of these algorithms is the scoring function used to evaluate the quality of peptide-spectrum matches (PSMs). In this paper, we present new approach to scoring of PSMs. We argue that since this problem is at its core a ranking task (especially in the case of de novo sequencing), it can be solved effectively using machine learning ranking algorithms. We developed a new discriminative boosting-based approach to scoring. Our scoring models draw upon a large set of diverse feature functions that measure different qualities of PSMs. Our method improves the performance of our de novo sequencing algorithm beyond the current state-of-the-art, and also greatly enhances the performance of database search programs. Furthermore, by increasing the efficiency of tag filtration and improving the sensitivity of PSM scoring, we make it practical to perform large-scale MS/MS analysis, such as proteogenomic search of a six-frame translation of the human genome (in which we achieve a reduction of the running time by a factor of 15 and a 60% increase in the number of identified peptides, compared to the InsPecT database search tool). Our scoring function is incorporated into PepNovo+ which is available for download or can be run online at: http://bix.ucsd.edu.
MS/MS; scoring; peptide; PSM; de novo; database search; machine learning; ranking; boosting
Two new statistical models based on Monte Carlo Simulation (MCS) have been developed to score peptide matches in shotgun proteomic data and incorporated in a database search program, MassMatrix (www.massmatrix.net). The first model evaluates peptide matches based on the total abundance of matched peaks in the experimental spectra. The second model evaluates amino acid residue tags within MS/MS spectra. The two models provide complementary scores for peptide matches that result in higher confidence in peptide identification when significant scores are returned from both models. The MCS-based models use a variance reduction technique that improves estimation precision. Due to the high computational expense of MCS-based models, peptide matches were prefiltered by other statistical models before further evaluation by the MCS-based models. Receiver operating characteristic analysis of the data sets confirmed that MCS-based models improved the overall performance of the MassMatrix search software, especially for low-mass accuracy data sets.
shotgun proteomics; tandem mass spectrometry; database search; Monte Carlo Simulation; amino acid residue tag
Peptide identification from tandem mass spectrometry (MS/MS) data is one of the most important problems in computational proteomics. This technique relies heavily on the accurate assessment of the quality of peptide-spectrum matches (PSMs). However, current MS technology and PSM scoring algorithm are far from perfect, leading to the generation of incorrect peptide-spectrum pairs. Thus, it is critical to develop new post-processing techniques that can distinguish true identifications from false identifications effectively.
In this paper, we present a consistency-based PSM re-ranking method to improve the initial identification results. This method uses one additional assumption that two peptides belonging to the same protein should be correlated to each other. We formulate an optimization problem that embraces two objectives through regularization: the smoothing consistency among scores of correlated peptides and the fitting consistency between new scores and initial scores. This optimization problem can be solved analytically. The experimental study on several real MS/MS data sets shows that this re-ranking method improves the identification performance.
The score regularization method can be used as a general post-processing step for improving peptide identifications. Source codes and data sets are available at: http://bioinformatics.ust.hk/SRPI.rar.
Peptide identification using tandem mass spectrometry is a core technology in proteomics. Latest generations of mass spectrometry instruments enable the use of electron transfer dissociation (ETD) to complement collision induced dissociation (CID) for peptide fragmentation. However, a critical limitation to the use of ETD has been optimal database search software. Percolator is a post-search algorithm, which uses semi-supervised machine learning to improve the rate of peptide spectrum identifications (PSMs) together with providing reliable significance measures. We have previously interfaced the Mascot search engine with Percolator and demonstrated sensitivity and specificity benefits with CID data. Here, we report recent developments in the Mascot Percolator V2.0 software including an improved feature calculator and support for a wider range of ion series. The updated software is applied to the analysis of several CID and ETD fragmented peptide data sets. This version of Mascot Percolator increases the number of CID PSMs by up to 80% and ETD PSMs by up to 60% at a 0.01 q-value (1% false discovery rate) threshold over a standard Mascot search, notably recovering PSMs from high charge state precursor ions. The greatly increased number of PSMs and peptide coverage afforded by Mascot Percolator has enabled a fuller assessment of CID/ETD complementarity to be performed. Using a data set of CID and ETcaD spectral pairs, we find that at a 1% false discovery rate, the overlap in peptide identifications by CID and ETD is 83%, which is significantly higher than that obtained using either stand-alone Mascot (69%) or OMSSA (39%). We conclude that Mascot Percolator is a highly sensitive and accurate post-search algorithm for peptide identification and allows direct comparison of peptide identifications using multiple alternative fragmentation techniques.