|Home | About | Journals | Submit | Contact Us | Français|
Tandem mass spectrometry, run in combination with liquid chromatography (LC-MS/MS), can generate large numbers of peptide and protein identifications, for which a variety of database search engines are available. Distinguishing correct identifications from false positives is far from trivial because all data sets are noisy, and tend to be too large for manual inspection, therefore probabilistic methods must be employed to balance the trade-off between sensitivity and specificity. Decoy databases are becoming widely used to place statistical confidence in results sets, allowing the false discovery rate (FDR) to be estimated. It has previously been demonstrated that different MS search engines produce different peptide identification sets, and as such, employing more than one search engine could result in an increased number of peptides being identified. However, such efforts are hindered by the lack of a single scoring framework employed by all search engines.
We have developed a search engine independent scoring framework based on FDR which allows peptide identifications from different search engines to be combined, called the FDRScore. We observe that peptide identifications made by three search engines are infrequently false positives, and identifications made by only a single search engine, even with a strong score from the source search engine, are significantly more likely to be false positives. We have developed a second score based on the FDR within peptide identifications grouped according to the set of search engines that have made the identification, called the combined FDRScore. We demonstrate by searching large publicly available data sets that the combined FDRScore can differentiate between between correct and incorrect peptide identifications with high accuracy, allowing on average 35% more peptide identifications to be made at a fixed FDR than using a single search engine.
High-throughput proteome analyses are now commonplace, allowing researchers to assess the proteins present in a sample and, by utilising new technologies, to quantify protein abundance on a large scale. The high-throughput methods can generate large volumes of data for which manual verification of peptide and protein identification is not feasible, so automated methods are required to make correct identifications. It is not yet clear how best to determine which peptide or protein identifications are correct, and how to optimise identification pipelines such that false discovery is kept sufficiently low while maximising the number of proteins that can be identified correctly1.
There are a number of software applications, both commercial and open-source, for identifying peptides from mass spectra2-6. Each application produces a set of non-standard, algorithm-dependent measures of the quality of peptide and protein identifications. Several search engines produce an expectation value (e-value), which relates to the likelihood of a peptide identification having being made by chance. However it has been recently demonstrated that e-values are not comparable between different packages7. Without a search engine independent measure, it is difficult to optimise identification pipelines, and researchers are likely to set stringent thresholds (often with a limited understanding of the underlying statistical model), to ensure that the rate of false positives is acceptably low. One approach for validating identifications involves the use of a decoy database, which allows statistical confidence to be placed on an identification set, by showing the rate of hits to decoy sequences (such as reversed or randomised protein sequences), from which an estimate of the number of false positives can be calculated for a given threshold8.
It has been demonstrated that different software packages do not produce the same peptide identifications for large sets of spectra7, particularly for peptides scoring close to the threshold for acceptance or rejection. This means that it should be possible to extract more identifications from a set of spectra by employing multiple search engines, if there is a framework for combining the identification sets. In this work, we have developed a search engine independent scoring system that assigns a score to each peptide-spectrum match based on the frequency of false discovery for identifications made with such a score or better. The score is similar to a q-value, which has recently been demonstrated for use in proteomics [ref] as a search engine independent scoring system for each peptide-spectrum identification. A q-value is calculated as follows (a graphical example is shown in Figure 3). First, identifications are ordered according to some metric of quality, such as peptide ion score from Mascot or XCorr from Sequest. Second, for each score associated with a peptide-spectrum match, the cumulative FDR is estimated (for example from a decoy database search) that would result if that exact score was set as the threshold for acceptance or rejection of identifications. Third, a q-value is assigned to each match as the minimum FDR at which the identification could be made, i.e. the weakest threshold that could be set to include an identification without increasing the number of false positive (see Methods detailed algorithm). Q-values are useful for setting thresholds that guarantee the reported FDR is less than a given threshold. However, q-values are less useful for further calculations for the following reason. Q-values follow a stepwise distribution where all target identifications with no intervening decoy identifications share the same q-value and relative information about the quality of an identification is lost. Furthermore, within a set of identifications that share the same q-value, the strongest identification will always be a decoy identification, such that q-values are biased against decoy identifications. Therefore, we have adapted the calculation of q-values such that rather than following the stepwise distribution of q-values, the software uses localised linear regression to estimate the FDR for the target identifications between each decoy identification (see Methods for algorithm), called the FDRScore. Each peptide-spectrum identification therefore can have three related measures: i) the estimated FDR, ii) the q-value and iii) the FDRScore as demonstrated in Figure 1. Across an entire identification set, all three values are roughly similar however on a localised scale the FDRScore is more useful for further calculation, since it maintains the ordering of identification quality (which is lost in both estimated FDR and q-value), and each FDRScore assigned to a target identification is likely to be closer to the actual FDR associated with a peptide-spectrum match than either the estimated FDR or the q-value. Finally, all target identifications scoring higher than the best decoy hit have an estimated FDR and q-value = 0, although their probability of being incorrect is not zero. The regression method used to calculate FDRScores includes the origin (0,0) for the first calculation and hence no identification has an FDRScore = 0.
The assignment of an FDRScore to each identification allows the identification sets produced by different search engines to be compared using a single metric, and enables the combination of the total identification set across all spectra searched with each engine. We have integrated the results from Mascot9, and two open source applications Omssa4 and X!Tandem3, and the process can be followed relatively simply by any laboratory using the search engines that are available to them.
Our results demonstrate that the FDR is far higher for peptides identified by only a single search engine. In contrast, if a peptide has been identified by all three search engines, it is highly unlikely to be a false positive. As such, we have developed a system for calculating a second metric using a similar basis to the FDRScore, called a combined FDRScore, which re-assesses the rates of false discovery for identifications made by only one search engine, by each distinct pair of search engines, or by all three search engines after data have been combined (see Methods for details). The combined FDRScore appears to be a highly effective discriminator between a correct and incorrect peptide identification, and as such, for a fixed false discovery rate, e.g. 1% FDR, on average gains of 35% total peptide identifications are possible over the best individual search engine.
Proteome data sets from the public data repository PeptideAtlas10, shown in Table 1, were downloaded in mzXML format and converted to Mascot Generic format (MGF) using the RAMP parser (http://tools.proteomecenter.org/TPP.php). The majority of the data sets were from yeast (Saccharomyces cerevisiae), selected to cover a range of contributing laboratories, experimental approaches and data set sizes. The data sets were searched using Omssa (version 2.1.1), Mascot (version 2.0) and X!Tandem (version 07-07-01) using a parameter set matching the original search parameters as closely as possible: parent ion tolerance 2Da, fragment ion tolerance 0.8Da and average mass setting. Data sets contained different types of variable and fixed modification, e.g. ICAT, carbamidomethyl of cysteine and oxidation of methionine. The system was also tested using searches of human and mouse data from PeptideAtlas, and validated by data sets released by ABRF (Association of Biomolecular Research Facilities [ref]). ABRF have generated a standard protein set containing 49 known proteins, which allow the actual false discovery rate to be calculated and compared to the estimates from the decoy search approach. ABRF data sets were searched with parent ion tolerance 1.2Da, fragment ion tolerance 0.6Da and monoisotopic mass, to reflect more closely the setting used by the laboratories that produced the data.
The following databases were searched: Yeast SGD ORFs (http://www.yeastgenome.org/), IPI human and IPI mouse (http://www.ebi.ac.uk/IPI/). ABRF data sets were searched against a database constructed specifically by ABRF containing a combination of UniProt human, SwissProt human and additional contaminant proteins expected to be in the sample [ref]. Decoy databases were created by reversing all the protein sequences, and adding the set of reversed sequences to the standard sequences in the same database file. While there is still some discussion in the field, this method of creating a decoy database has been demonstrated to be a robust model for calculating FDR8. By searching the forward and reverse database simultaneously, standard and decoy sequences can compete equally to be the highest ranking identification for each spectrum, adequately representing how normal false positive identifications are made.
In the results set, only the top ranking peptide identification for each spectrum is included. We compared the results sets including only the top ranking identification with those including the top three ranking identifications and discovered that there is no significant increase in the number of peptides identified for a fixed FDR (data not shown).
A method has been published by Elias and Gygi for calculating false discovery rate (FDR) using decoy databases [ref]. Assuming a search is made against a database constructed from equal sized target and decoy database, the number of false positive peptide identifications is calculated for a given threshold by doubling the number of hits to the decoy database, following the logic that for every hit to a decoy sequence, there will be a “silent” incorrect hit in the standard database. However, we believe this measure of false discovery rate can be misleading, since it does not reflect the false discovery rate within the targets, which ultimately is the measure that researchers are interested in. It is trivial to remove the decoy identifications from a set of peptide identifications (since they are flagged with a particular identifier). In the remaining set of targets, it can be assumed on average that the number of targets which are false positives is approximately equal to the number of decoy hits that have been removed. The calculated value is different from the measure of FDR using the equation of Elias and Gygi.
|False positives (FP)||= 2 * decoy hits|
|True positives (TP)||= All targets above threshold - FP|
|False discovery rate (FDR)||= FP / FP + TP|
Example for 1000 identifications, 20 decoy identifications.
FDR = 40 / 1000 = 4%
|All targets||= Target hits only (above threshold)|
|False positives (FP)||= Decoy hits|
|True positives (TP)||= All targets above threshold - FP|
|False discovery rate (FDR)||= FP / FP + TP|
Example for 1000 identifications, 20 decoy identifications.
All targets = 980 (1000-20)
FP = 20
TP = 960 (980-20)
FDR = 2.04% (20 / 980)
Mascot, Omssa and X!Tandem each produce an e-value for each identification, which relates to the frequency at which such match would have been expected to be made by chance. Low e-values indicate that an identification is unlikely to have been made by chance. As has been demonstrated previously [ref], the e-values produced by the three search engines are not comparable. Table 2 displays the e-value threshold that produces a 50% FDR for the different test Yeast data sets (all searched against the same database, since e-values are proportional to database size). Clearly, there are significant differences in the threshold at which 50% FDR is achieved in different experiments, and as such an e-value should not be used except to rank the quality of matches within a single experiment for a particular search engine.
Identifications produced by the three search engines are treated independently by the software. For each search engine, the identification set is ordered by increasing e-value (decreasing quality of match). The estimated FDR, q-value and FDRScore are calculated for each peptide-spectrum match as follows (illustrated graphically in Figure 1).
For each set of peptide identifications made by one search engine:
For each peptide-spectrum identification the calculated FDRScore approximates the frequency of false positives that would be observed if a particular score was set as a threshold for an individual search engine. However, we observe that peptide-spectrum matches that are made by all three search engines tested have a far lower actual FDR than the general distribution. Indeed in certain data sets, decoy identifications are never observed in the set of identifications made by all three search engines, even if the individual scores from each search engine are weak. Conversely, the distribution of peptide-spectrum identifications made by only a single search engine shows that the FDR is high, even for identifications with strong scores from the source search engine.
In order to quantify the effect of search engine agreement, all peptide-spectrum identifications are divided into seven sets according to which search engines have identified them (Figure 2): 1) Tandem only, 2) Mascot only, 3) Omssa only, 4) Omssa and Tandem, 5) Mascot and Tandem, 6) Omssa and Mascot, and 7) Mascot, Omssa, and Tandem. The same algorithm as above is used to re-assign FDRScores calculated within each of the seven distinct sets. Instead of identifications being ordered by e-value, they are ordered by the FDRScores calculated in the first stage. In sets 4-6 all peptide-spectrum matches have an FDRScore assigned from each of the two search engines, and in set 7, the peptide-spectrum matches have three FDRScores, one assigned from each search engine. As such, an average FDRScore is assigned to each identification. For the average FDRScore, the geometric mean is used (calculated as the nth root of the product of n numbers) as we found empirically that a geometric mean is a better differentiator between correct and incorrect identifications than an arithmetic mean, since an arithmetic mean can mask the contribution of low FDRScores (data not shown).
For each of the seven sets of peptide-spectrum matches, the same algorithm as above is used to re-calculate the FDRScore for each identification, but rather than using the e-value produced by a single search engine to order identifications, identifications are ordered within each set by the average FDRScore. Each identification is assigned a second FDRScore, called the combined FDRScore which reflects the estimated false discovery rate within the set.
For each set of peptide identifications made by a particular combination search engine:
This process is performed independently for each of the seven sets of identification and is demonstrated in Figure 3 for an example experiment from Peptide Atlas. We can observe that the peptide spectrum matches made by only a single search engine have a considerably higher estimated FDR (represented by the combined FDRScore on the y-axis) than those made by a pair of search engines. In the set of identifications made by all three search engines (set 7), the estimated FDR is low, and false positives are rarely observed. In certain data sets, decoy hits are not observed at any score threshold for identifications made by all three search engines. To correct for the size of the result set, an artificial decoy hit is added at the end of each data series, such that no identification has a combined FDRScore = 0.
The combined FDRScore also has the property that it can be used to set a threshold x, returning a set of identifications with FDR ~= x (in practice almost always < x). The seven identification sets are distinct, as such a final set of peptide identifications is made by accepting all identifications with say combined FDRScore < 0.05. Each of the seven sets returns a certain number of identifications with no more than 5% FDR. In practice sets 1-3 (single search engine only) may return few, if any, identifications. This is demonstrated by the dashed line on Figure 3. All identifications with a combined FDRScore below 0.05 would pass the threshold.
The advantage of this approach is that it extracts the maximum number of peptide-spectrum matches that have the profile of being “correct” while excluding those that have the profile of being incorrect.
The method outlined above was used to combine results and weight the contribution of different search engines. The number of peptides identified by each search engine with FDR < 0.01 and FDR < 0.05 was calculated, and compared with the number of peptides identified using the combined FDRScore to set threshold for data combined from the three search engines. Table 3 displays the percentage improvement using combined FDRScores over the best individual search engine (defined as the search engine that returns the highest number of peptide identifications at particular q-value threshold) in columns 2 and 3. On average the combined scoring method identifies 35% more peptides than the best individual search engine at FDR < 0.01. There is quite a large difference in the percentage gain between the lowest PA66 (7%) and the highest PA162 (68%). It is interesting to note that in the experiments where only modest gains in the number of peptides are made (PA66, PA77, PA93 and PA146) that X!Tandem performs poorly, identifying only a fraction of the number of peptides as Omssa and Mascot (see supplementary data). As such, data are effectively being combined from two search engines only. In other experiments, Omssa, X!Tandem and Mascot all perform similarly well, with either Omssa or X!Tandem identifying the highest number of peptides at 1% FDR.
The Association of Biomolecular Research Facilities (ABRF) have generated an artificial mixture of 49 known proteins to allow proteome technologies to be validated in laboratories. Data sets generated by several laboratories have been released, and are publicly available, although detailed analyses of the data sets has yet to be published. From the different laboratories, we selected the highest quality data set (laboratory 12874) that was available from ProteomeCommons [ref] to test that the software was correctly differentiating true and false positive peptide identifications. According to ABRF preliminary data from ABRF [ref] laboratory 12874 correctly identified 45 out of 49 of the proteins with 0 false positive identifications.
The data set was searched with the three search engines and combined FDRScores calculated. At a threshold of combined FDRScore < 0.01, 2451 peptide-spectrum matches were made combining results from the three search engines as outlined above from 6027 spectra. There were 10 decoy identifications within the list of 2451 which were removed. Of the remaining peptides 16 could be matched to proteins not expected in the analysis and the resulting peptide FDR is thus 0.0066 (16/2441). Of the 16 target false positive peptides, it is not possible to distinguish whether these are false positives caused by the search method or contaminants in the sample. However, it is clear that setting an combined FDRScore < 0.01 results in an acceptably low false positive rate, and that the method does not introduce a major bias to identify targets over decoys.
Since the complete data sets have yet to be published for the ABRF data sets it is not feasible to make a detailed comparison of the total number of peptide identifications made by our method. However, laboratory 12874 submitted 442 non-redundant peptide identifications. The combined scoring methods identifies 677 non-redundant peptides identifications (from the 2451 total redundant set), an increase of 53%.
Large scale proteome analyses produce significant quantities of data but they are time-consuming and costly to run. Running more technical replicates can lead to larger numbers of identifications, but is not always practical or cost-effective. Furthermore, software has been developed to determine absolute protein quantitation by counting the frequency of peptides identified by mass spectra16,17. Any methods that can greatly increase the number of peptide identifications from a single study are therefore significant, as they can reduce the number of replicates required, and reduce the overall cost and time to run an experiment. It has been previously suggested that by using multiple search engines, a higher proportion of the proteome can be sampled18, but such efforts have been hampered by the lack of consensus on a software independent score.
We have re-searched considerable volumes of data, downloaded from PeptideAtlas, with Mascot, Omssa and X!Tandem. Each of the search engines produces an e-value that can be used within an experiment to score the relative quality of peptide-spectrum matches. However, the e-values have little correlation across different search engines and are not a reliable indicator of identification quality. By using a decoy database search, in each experiment a threshold can be set that ensures the rate of false discovery is sufficiently low. In this work, we introduce the concept of an FDRScore, which reflects the predicted rate of false discovery for an identification made with a particular score, reported by a single search engine on a specific data set. The FDRScore has a similar basis to the statistical measure of a q-value but incorporates simple linear regression to maintain the order of identification quality which is lost in the calculation of q-values. Crucially, FDRScores allow identifications from different search engines to be combined within the same scoring framework. It has previously been demonstrated that due to differences in how search engines score identifications, there are differences in the sets of peptides discovered. By analysing false discovery rates, we are able to demonstrate that if different search engines agree on identifications, the frequency of false positives is low. However, even in the sets of peptides identified by a single search engine, true positives can still be found. The combined FDRScore allows identifications to be extracted from identifications made by one, two or three search engines, maximising the number of peptide identifications while ensuring that the minimum number of false positives pass the threshold.
The benefits of combining multiple search engine results has also been demonstrated by the Scaffold software21. Scaffold does not rely on rates of false discovery but instead works on a related metric: the probability of a peptide being correct or incorrect. Identification probabilities can be calculated for each search engine, and Scaffold calculates a combined probability of correct identification if more than one search engine has identified the same peptide from a spectrum. The relationship between identification probability (or error probabilities) and FDR has been examined recently by Käll et al.22. They argue that error probabilities are more relevant where the presence of a specific peptide or protein is being considered. In large scale proteome scans, setting thresholds by FDR appears to provide a better measure for balancing the trade-off of between false positives and sensitivity. In the data presented for Scaffold21, a 33% gain in peptide identifications is reported over the best performing single search engine in an 18 protein sample, and a 14% increase in a more complex sample, at a 1% FDR. On average, our method identifies 35% more peptides than the best single search engine at 1% FDR in complex data sets. It appears that analysing FDR rather than identification probability allows larger gains in sensitivity, and that the Scaffold software could be further improved by incorporating an FDR-based score.
In summary, we present a proposal for a software-independent measure of the quality of an identification, the FDRScore, which can be assigned to all identifications when a decoy database search has been performed. We have utilised the FDRScore to combine data across search engines, and estimated the total number of true positive proteins that could be detected. We have demonstrated how the FDRScore can be re-assessed to reflect the contribution of evidence from different search engines. The combined FDRScore is an effective discriminator between correct and incorrect identifications, allowing considerable gains in the number of peptides that can be identified for a fixed FDR.
Work in Manchester by AJ was funded by a grant from the BBSRC. JAS, SJH, NWP also acknowledge BBSRC funding on the ISPIDER grant, ref BBS/B/17204. The authors thank Julian Selley for assistance with Mascot searches.