|Home | About | Journals | Submit | Contact Us | Français|
This manuscript provides a comprehensive review of the peptide and protein identification process using tandem mass spectrometry (MS/MS) data generated in shotgun proteomic experiments. The commonly used methods for assigning peptide sequences to MS/MS spectra are critically discussed and compared, from basic strategies to advanced multi-stage approaches. A particular attention is paid to the problem of false-positive identifications. Existing statistical approaches for assessing the significance of peptide to spectrum matches are surveyed, ranging from single-spectrum approaches such as expectation values to global error rate estimation procedures such as false discovery rates and posterior probabilities. The importance of using auxiliary discriminant information (mass accuracy, peptide separation coordinates, digestion properties, and etc.) is discussed, and advanced computational approaches for joint modeling of multiple sources of information are presented. This review also includes a detailed analysis of the issues affecting the interpretation of data at the protein level, including the amplification of error rates when going from peptide to protein level, and the ambiguities in inferring the identifies of sample proteins in the presence of shared peptides. Commonly used methods for computing protein-level confidence scores are discussed in detail. The review concludes with a discussion of several outstanding computational issues.
More than a decade after the beginning of rapid expansion in proteomic technologies and applications, proteomics remains a fast growing field. Generally defined, proteomics is an integrative study of proteins, and their biological functions and processes. An overarching goal of proteomics is to achieve complete and quantitative analysis of the proteome of a species, including the sub proteomes of various cells or tissue types in the case of multi-cellular organisms. This also includes the reconstruction of protein interaction networks and protein complexes and their dynamic changes, cellular localization analysis, delineation of kinase – substrate relationships, and many other biological applications .
While there exist a number of alternative proteomics strategies (e.g. protein array based methods ), mass spectrometry (MS)-based strategies have become the method of choice for both identification and quantification of proteins in most studies. In this regard, the last several years have been particularly exciting for the field. With the advent of new MS instrumentation, alternative fragmentation mechanisms, and advanced data acquisition strategies, the throughput and the depth of the proteomic analysis have improved by an order of magnitude compared to earlier applications. This has enabled many powerful proteomic applications, including global analysis of post-translational modifications , large-scale reconstruction of protein interaction networks , and deep quantitative proteome profiling of model organisms . Significant efforts are being made to introduce proteomic technologies in clinical and translational research . MS-based proteomics is now increasingly applied in the context of systems biology studies where it is used in parallel with other technologies such as gene expression analysis and metabolomics. MS-based findings are being increasingly annotated in knowledge repositories such as UniProt. MS-specific repositories are also quickly growing, with new resources for various domain applications such as phosphoproteomics being constantly created .
Proteomics, like all high-throughput technologies, is extremely dependent on the ability to quickly and reliably analyze large amounts of experimental data. In the absence of robust statistical and computational methods, proteomic datasets contain significant numbers of false positives [8-13], and statements referring to computational analysis of MS/MS data as e.g. “the Achilles heels of proteomics” are common in the literature . The high rate of false positives in early proteomic publications was so alarming to the scientific community that it lead to the establishment of specific data analysis guidelines by the Editorial Boards of the leading proteomic journals . In recent years, there has been a substantial progress in addressing the most immediate proteomic data analysis needs. Several commercial and open source data analysis pipelines became available and allowed faster and more transparent analysis of proteomic data than previously possible. At the same time, the dramatic change in the size, diversity, and context in which proteomic datasets of today are generated creates a need for a survey and detail discussion of the existing and emerging computational strategies. In this manuscript, we review the process of identifying peptides and proteins from MS data, the resulting data analysis challenges, and the existing computational methods, with a focus on data generated using the shotgun proteomics strategy.
The shotgun proteomics approach is presently the method of choice for identifying proteins in most large-scale studies, with many excellent reviews available describing this technology and its biological applications (e.g. see [16-18]). This strategy involves several major steps (see Figure 1), which are summarized below to provide the necessary background for subsequent discussion of the computational strategies and data analysis issues related to these data.
A key step in shotgun proteomics is digestion of proteins into peptides using proteolytic enzymes such as trypsin (optionally, using multiple different enzymes). In the example of trypsin, the enzyme cleaves peptides after arginine and lysine residues (unless followed by proline), and thus the majority of the resulting peptides are expected to conform to the trypsin cleavage rules on both ends of the sequence (“tryptic peptides”). They should also have no or just a few internal trypsin cleavage sites (“missed cleavages”). Performing protein digestion has many advantages over methods of analysis based on MS/MS sequencing of intact proteins. As a drawback, since each protein digested with trypsin produces multiple peptides (on average about 50), the resulting peptide mixtures can be very complex. Thus, prior to digestion, a protein separation procedure may be employed, e.g., using 1-D SDS-PAGE or organellar based separation, to divide the total protein content of the sample into sub-fractions to reduce the sample complexity. Elimination of extensive protein separation steps prior to MS-based identification, such as two dimensional (2D) gels, allows higher data throughput and protein detection sensitivity. The protein digestion step is often followed by a selective peptide enrichment (depletion) strategy designed to capture peptides having certain specific properties of interest (e.g. N-linked glycosylated peptides, phosphorylated peptides, etc. ). Resulting peptide samples are then further separated using reverse phase chromatography coupled online to the mass spectrometer. Alternatively, peptides can be spotted on a MALDI plate for subsequent analysis using MALDI-source equipped MS instruments. Peptides (or proteins) may also be labeled, chemically or metabolically, with a stable isotope tag to allow quantitative comparison of protein abundances across several samples (e.g. treated cells vs. control) .
Another critical step is peptide sequencing using MS. Peptides, as they elude from the reverse phase column at a particular time (retention time) are ionized, transferred into the gas phase, and selected ions are subjected to tandem mass spectrometry (MS/MS) sequencing to produce fragment ion spectra (MS/MS spectra) . The data acquisition process consists of multiple stages:
(a) The instrument scans all peptide ions that are introduced into the instrument at any given time and records the so-called MS1 spectrum consisting of mass-to-charge ratios (m/z values) and intensities of all peptide ions. (b) Selected peptide ions (‘precursor’ or ‘parent’ ions) observed in the MS1 spectrum are broken down into smaller pieces (fragment ions) in the collision cell of the MS instrument. The acquired MS/MS (or MS2) spectrum is a list of m/z values and intensities of all the fragment ions generated by fragmenting an isolated precursor ion. The fragmentation pattern encoded by the MS/MS spectrum allows identification of the amino acid sequence of the peptide that produced it. The fragmentation is most often based on the process termed collision induced dissociation (CID). However, several alternative mechanisms have recently become available on commercial MS instruments and are used routinely for specialized applications such as sequencing of peptides with PTMs. These fragmentation mechanisms include the electron transfer dissociation (ETD) and Higher energy Collision dissociation (HCD). Furthermore, some instruments can be operated in a multi-stage mode with automated data-dependent triggering of MS3 acquisition or, alternatively, using a technique referred to as Multistage Activation (MSA).
Importantly, the mass accuracy and resolution of the MS analyzer have a significant effect on the information content of the spectrum, which in turn is of great importance for the subsequent peptide identification step. The accuracy with which an MS instrument can measure peptide ion m/z values ranges from as low as several parts per million (ppm) in the case of high mass accuracy instruments such as LTQ-Orbitrap, to more than 500 ppm in the case of low mass accuracy instruments. Even with high mass accuracy instruments, achieving truly high accuracy often requires fine instrument tuning, room temperature control, and use of internal or external (computational) calibration. Similarly, the mass resolution of the instrument governs the ability to accurately determine the charge state of the peptide ion. The ability of the instrument to isolate for MS/MS sequencing parent ions within a narrow window around a particular m/z is also dependent on many factors and is important for minimizing the number of co-fragmented peptides.
After the first step of acquiring a desired amount of MS data, the effort shifts toward the computational analysis. The central element here is the identification of the peptides that gave rise to the measured MS/MS spectra. The peptide identification strategies can be roughly classified into several categories (see Figure 2). Peptide identification can be performed by correlating acquired experimental MS/MS spectra with theoretical spectra predicted for each peptide contained in a protein sequences database (database search approach), or against spectra from a spectral library (spectral library searching). Alternatively, peptide sequences can be extracted directly from the spectra, i.e., without referring to a sequence database for help (de novo sequencing approach). There are also hybrid approaches, such as those based on the extraction of short sequence tags (3-5 residues long) followed by database searching. Finally, a number of iterative or multi-stage strategies have been proposed combining elements of different strategies in a single data analysis pipeline.
Sequence database searching remains the dominant method for assigning peptide sequences to MS/MS spectra, and the number of available computational tools continues to grow (Table 1). The search program takes as input the experimental MS/MS spectrum and compares it against theoretical fragmentation spectra generated for peptides from the searched protein sequence database (see Figure 2). Importantly, the comparison is performed not against all possible peptide sequences, but against a much smaller set of candidate peptides. The candidate peptide list is generated by the program using in silico database digestion and application of several criteria. The most important criteria include the parent ion mass tolerance, enzyme digestion constraint (e.g. allowing tryptic peptides only), and what if any post-translational or chemical modifications are allowed. Additional search parameters (used in the scoring function) include the type of fragment ions expected in the spectrum (e.g. y and b ions in CID), and the fragment ion mass tolerance. For a detailed discussion of the data search parameters see . The output from the program is a list of peptides for each MS/MS spectrum, ranked according to the search score. In most cases, only the best scoring peptide to spectrum match (PSM) for each MS/MS spectrum is considered as the potential peptide identification and is taken to the subsequent statistical data validation step (see section 5).
The search score calculated by the database search program essentially measures the degree of similarity between the experimental MS/MS spectrum and the theoretical spectrum. There is a number of scoring schemes that have been described in the literature and implemented in the currently available search tools. These include the class of spectral correlation functions, from a simple dot product to a more advanced cross correlation function (e.g. SEQUEST, X! Tandem, OMSSA, MASCOT), scoring functions based on empirically observed rules (e.g., SpectrumMill), or statistically derived fragmentation frequencies (e.g., PHENYX). The score that is actually reported by the tool can be on an arbitrary scale (e.g., Xcorr score in SEQUEST), or converted to a statistical measure such as p-value or the expectation value, E-value (see section 5.1 below).
While the rigorous and objective comparison is somewhat hard to perform, empirical evidence indicates that some search tools perform better than others in different settings (e.g. depending on the type of instrument used to generate the data) [22-24]. When applied to the same dataset, the overlap between different search tools is typically in the range of 70-80%, suggesting that application of multiple search tools should increase the overall rate of peptide identification [25-29]. The scoring functions implemented in most tools were optimized for spectra generated using the CID fragmentation mechanism. With the advent of ETD and HCD instrumentation, recent efforts focused on the development of new scoring systems specifically for the analysis of peptide ETD fragmentation data  or phosphorylated peptide data . For example, a substantial increase (up to 80%) was recently demonstrated compared to traditional scoring functions by using an ion type weighting schemes that depend on the precursor peptide ion charge state and the sequence . Even for CID spectra, however, there is a potential improvement that can be achieved via better utilization of fragment ion intensities in the scoring models [33-40].
New search tools are also being developed in support of data independent acquisition schemes , in which multiple precursor ions are purposely fragmented simultaneously (i.e. without isolating a peptide ion species of a specific m/z value as in the conventional data dependent acquisition methods). The database search time remains an important consideration and can be improved via database indexing [42-44] or other algorithmic improvements [45-47], as well as using grid  or cloud  computing.
As mentioned above, the list of candidate peptides is created for each experimental MS/MS spectrum based on the user specified set of search parameters. These parameters essentially reflect the prior knowledge regarding the experiment and can be referred to as auxiliary information useful for separating true from false identifications (also, see section 6.2). In a typical analysis, the precursor ion mass tolerance and the enzyme digestion constraint are the most important. For example, the knowledge of the digestion process allows limiting the search space to only those peptides that conform to the digestion rules specific to the used proteolytic enzyme. This has the benefit of significantly reducing the number of comparisons that need to be made and thus increases the speed of the analysis compared to the enzyme un-constrained or semi-constrained search (i.e. requiring at least one tryptic end). In the case of high mass accuracy instruments one can specify a very narrow mass window (e.g. 5 ppm), compared to a ~ 2 Da window that is commonly used with low mass accuracy data. Performing enzyme-constrained searches, however, has disadvantages. It becomes impossible to identify peptides that exhibit unspecific cleavage, e.g., due to post-translational processing (e.g., removal of the signal peptide), due to contaminating enzymes present in the sample, or because they are products of in-source or in-solution fragmentation of other (tryptic) peptides. Similarly, using too narrow mass window may filter out valid peptide identifications with inaccurately measured peptide mass. Furthermore, instead of severely restricting the list of candidate database peptides, one may instead (given sufficient computational resources) perform a less constraint search and then utilize the auxiliary information as a part of the subsequent, post-database search data validation step (see section 6.4 below).
The choice of the sequence database for MS analysis depends on the goal of the experiment . For many organisms, multiple sequence databases are available. These include the Entrez Protein sequence database from the National Center for Biotechnology Information, its higher quality subset database RefSeq, and UniProt (consisting of Swiss-Prot and its supplement TrEMBL). The International Protein Index (IPI) database - currently a popular database for MS analysis from the European Bioinformatics Institute (EBI) - will no longer be updated beyond 2010. The databases vary in terms of completeness, degree of redundancy, and the quality of sequence annotation. In most cases, using a better annotated database such as UniProt or RefSeq should be sufficient. When the identification of sequence polymorphisms is particularly important, one may attempt to perform searches against a larger database such as Entrez Protein. This database, however, in addition to true biologically relevant sequence variants also contain a large number of redundant sequences derived from GenBank entries representing partial mRNAs and sequencing errors. Searching large databases also reduces the sensitivity of peptide identification by introducing more false identifications (the likelihood of obtaining a high scoring random match increases with increasing database size). Genomic databases also can be used for MS/MS database searching. This is an attractive option when one wants to identify novel peptides not present in any protein sequence database, e.g., novel alternative splice forms [51-54]. Correcting genome annotations or validating gene models predicted based on e.g. expressed sequence tag (EST) data in some cases may by itself be the main goal of the analysis (e.g. it was the initial motivation for building the Peptide Atlas database ). A number of such studies have been recently reported using proteomic data from model organisms as well as from higher eukaryotes [56-65].
Scoring of MS/MS spectra, and the statistical assessment of peptide identification confidence, is also sensitive to the details of the pre-database search spectral processing. As a result, a large number of studies attempted to optimize the spectral processing steps [66-68], cluster redundant spectra [69, 70], recognize and accommodate spectra produced by co-fragmentation of two or more peptides (“chimera” spectra) [71-75], eliminate low quality spectra [51, 66, 76-81], and develop improved charge state determination algorithms [82-85]. A substantial computational effort has been devoted to improved determination of the measured peptide mass [86-88]. Unfortunately, only a few of the proposed approaches described above are currently used in practice due to unavailability of the software or due to difficulties with incorporating the new tools in the existing data analysis pipelines.
Instead of searching acquired MS/MS spectra against theoretically predicted spectra, MS/MS spectra can be assigned peptides by matching against a spectral library [89-92]. The spectral library is compiled from a large collection of experimentally observed MS/MS spectra identified in previous experiments. A newly acquired MS/MS spectrum is compared to library spectra (using a certain mass tolerance window to restrict the set of candidate spectra) to determine the best match . Existing spectral library search tools include SpectraST [55, 94], Bibliospec , and X! Hunter . The National Institute of Standards and Technology (NIST) spectral libraries are available for multiple organisms, and contain data from a variety of MS instrument types. Specialized spectral libraries have also been reported for several types of post-translationally modified peptides such as a ubiquitin and ubiquitin-like spectral library  and a phosphorylated peptide library .
The spectral library matching approach outperforms conventional sequence database searching in terms of speed, error rates, and sensitivity of peptide identification . Another advantage is that statistical models developed for assessing the validity of the peptide identifications by database searching (see section 5) are adaptable to this method . As a drawback, only those peptides can be identified whose spectra were identified previously and entered into the library. Even though the systematic proteome sequencing in the case of some model organisms [97, 98] has already reached a substantial depth and coverage, the existing libraries are still incomplete, especially with respect to peptides from low abundance proteins and peptides containing PTMs. As a potential solution to the problem of incompleteness, the spectra of peptides that are not represented in the spectral library can be predicted using computational methods . Furthermore, methods are being developed for unrestricted spectral library searching which can potentially allow the identification of peptides containing PTMs . As the amount of publicly available data grows (owing to the development of proteomic repositories such as PeptideAtlas, Pride, Peptidome, and Tranche data exchange system , see Table 1), there is a hope that all peptides that are detectable by MS, at least for the most frequently studies organisms, will eventually be discovered and annotated in spectral libraries. At present, however, the spectral library matching tools remain underutilized and used mostly as an additional step in multi-stage strategies (see section 4.2).
The spectral libraries themselves, however, are becoming a rich and very useful resource for many applications. First, they can be used to obtain an improved understanding of peptide fragmentation trends, which in turn can lead to improved database search or de novo scoring functions . They are also being extensively used for the development of targeted MS assays based on the Selective Reaction Monitoring (SRM) approach .
The advantage of the de novo sequencing approach (see Figure 2) over strategies that rely on a sequence database (or spectral library) as a reference is that it allows identification of peptides whose sequence is not present in the searched database (for a recent review on de novo methods see, e.g. ). Several tools are available that can automate this process (see Table 1 for a partial list). Still, de novo sequencing has not yet become a practically useful approach for large scale data analysis because it is computationally intensive and requires high quality MS/MS spectra. In the high throughput environment, and if the organism that is being studied has been sequenced, researchers often do not have the need or the time to follow up on peptides for which there is no exact match in the protein sequence database. As a result, in a typical experiment the computational analysis starts with database searching, and only then, if desired, de novo sequencing tools are applied to interrogate remaining unassigned spectra . However, de novo sequencing is an important approach in the case of organisms with unsequenced or only partially sequenced genomes. In those cases, tools such as MS-BLAST and similar approaches or extensions [104-107] can assist with the downstream analysis of the de novo derived peptide sequences to infer the identities of the sample proteins. De novo sequencing results can also be used simply as an additional source of information for validation of the peptide assignments obtained using database searching . Recent computational developments in the area of de novo sequencing focused on data generated using high mass accuracy instruments [109, 110], on data generated using HCD  and ETD  fragmentation mechanisms, or a combination of several mechanisms (e.g. CID and ETD) [113, 114].
A number of approaches have also been developed that combine the elements of de novo sequencing and database searching. One common approach is to start by extracting, for each acquired MS/MS spectrum, a set of short “sequence tags”  that are likely to be a part of the true peptide sequence. A tag is a short amino acid sequence with a prefix mass and a suffix mass values which designate its position within the peptide sequence. The database search for each MS/MS spectrum is then performed only against those candidate database peptides that contain one of the sequence tags extracted from that spectrum, thus reducing the number of comparisons to be made and the search time. The concept of sequence tag-assisted database searching has been further extended in recent years [116-122]. InsPecT  and TagRecon  are two examples of freely available open-source peptide identification tools that use tags as a filter to conduct the peptide identification (see Table 1). Several improved methods for sequence tag extraction have been recently presented as well [124-126]. As an alternative to using short sequence tags, one can extract longer subsequences using de novo methods to create a “spectral dictionary” [127, 128], or allow gaps in the sequence tags (“gapped peptides”)  which can then be searched against the sequence database.
Hybrid approaches are particularly useful for the identification of post-translationally or chemically modified peptides [123, 130, 131]. Allowing all possible types of modifications at all possible sites leads to a combinatorial explosion of the search space and is therefore poorly compatible with sequence database searching. The use of sequence tags, or related approaches such as look-up peaks, can reduce the size of the search space back to manageable levels.
Despite improvements in MS instrumentation and peptide identification methods, in a typical large-scale dataset a significant number of MS/MS spectra remain “unassigned” (i.e. there is no high confidence peptide assignment to the spectrum) when analyzed using existing tools . Many of these spectra are of high quality, as measured using various spectral features [76, 78, 134]. High quality spectra may remain unidentified in a typical data analysis workflow due to several reasons: constrained database search parameters (e.g. search for tryptic peptides only), inaccurate charge state or mass measurement of the precursor peptide ion, the presence of chemical or post-translational modifications not considered in the search, and incompleteness of the searched protein sequence database [133, 135, 136]. As a result, a number of strategies have been recently proposed for more comprehensive interrogation of MS/MS datasets to increase the number of identified peptides, of which peptides containing PTMs and novel peptides are of most biological interest. These strategies include ‘unrestrictive’ or ‘error-tolerant’ database searching, searching against a combination of proteomic and genomic databases [52, 54], and multi-stage search strategies involving multiple peptide identification tools or iterative application of the same tool.
While conventional strategies (including sequence tag-assisted database searching) allow only certain types of user-defined PTMs, unrestrictive or “blind” algorithms attempt to look for all possible post-translational or chemical modifications. In the extreme case, these tools may allow any mass shift between the mass of the database peptide and the precursor ion mass of the sequenced peptide (including mass shifts not corresponding to any known modification). Such methods are sometimes extended to include “error-tolerant” searches in which the algorithm allows one or more mismatches between the sequence of the peptide that produced the MS/MS spectrum and the database peptide sequence as a way to look for peptides containing sequence polymorphisms. One class of unrestrictive PTM search tools naturally builds on the concept of sequence tags [123, 137-139] reflecting the original (error-tolerant search) motivation behind that strategy . Alternatively, the unrestrictive search can be conducted via spectrum to sequence alignment [140-143], spectral clustering [144, 145], peptide motif analysis [146, 147], or other methods [148, 149].
Several database search tools (e.g. X! Tandem , SpectrumMill, and MASCOT ) allow iterative (multi-pass ) analysis. Furthermore, some tools, e.g. Paragon  - a module of Protein Pilot - implement it as the main strategy . The analysis may start with an enzyme-constrained search and allowing only most common modifications (or no modifications at all), which is then extended to look for peptides with less frequent modifications, nonspecific cleavages, etc. These additional searches may be performed only against the sequences of the proteins that were identified by at least one high scoring peptide in the initial search (“subset database”).
More elaborate multi-stage strategies use a combination of several different computational tools. For example, the analysis may again start with conventional database searching (possibly including multiple database search tools [153, 154]), but then involve the use of spectral library searching [152, 155-157], blind searching for PTMs, and genomic database searching  (see Figure 3 for illustration). Application of such strategies in a sequential manner, in which only high quality spectra that remain unassigned at a particular stage are passed to the next level of the analysis, allows increasing the number of assigned MS/MS spectra without a substantial increase in the computational time .
In addition to simply increasing the number of identified MS/MS spectra, multi-stage strategies described above can assist in obtaining a more complete picture of how the rates of various modifications (post-translational and chemical), as well as the proportions of peptides that are semi-tryptic peptides or contain multiple missed cleavages, vary from sample to sample as a function of the experimental or sample handling conditions. Such an analysis is particularly important in the context of targeted proteomic studies using SRM assays, where accurate peptide quantification requires normalization to account for peptide modifications and changes in the efficiency of trypsin digestion . It is likely that multi-stage strategies will play a prominent role in future proteomic studies. It should be noted, however, that routine application of such strategies, especially in the high throughput environment, requires substantial additional work on the development of statistical error rate estimation methods applicable to such complex strategies.
With an ever increasing size of experimental datasets, proteomic research is increasingly dependent on the automated processing of MS/MS spectra using computational tools described above. While the database search tools produce a match for almost every input MS/MS spectrum, only a fraction of those peptide to spectrum matches (PSMs) are true. In certain datasets, especially those generated using low mass accuracy instrumentation, incorrect PSMs are the majority. The main reasons for such a high failure rate are well known (see above and also [51, 159, 160]). Thus, the development of methods for assessing the confidence of PSMs, and for estimating the error rates resulting from filtering PSM data has become a crucial task (see Figure 4 for illustration). In the remainder of this section, unless otherwise noted, the discussion will focus on the statistical validation of the PSMs in the context of sequence database searching.
The process of assessing the validity of a PSM is not limited to the information contained in the database search score, but also can benefit from the use of auxiliary information (see section 6.2). However, for the sake of clarity, the discussion below will start with a focus on the database search scores as the primary source of information used for distinguishing between true and false identifications. The problem of assessing the confidence of a single PSM is considered first, and then the analysis is extended to the case of large datasets.
When an MS/MS spectrum is searched against the database, the outcome is a list of top scoring candidate database peptides ranked according to the database search score (e.g. hyperscore in X! Tandem, cross-correlation Xcorr score in SEQUEST) measuring the similarity between the acquired and the theoretical spectrum (or between the acquired spectrum and the library spectrum). Typically, only the top scoring candidate peptide – the best match- is considered as a possible correct match (going beyond the best match was recently investigated in ). The score of the best match can be converted into a statistical measure called p-value, or its close relative - the expectation value (E-value)  (see Figure 5 for illustration). For each best matching peptide assignment to an MS/MS spectrum, the null (random) distribution of scores can be estimated by constructing the histogram of scores for all candidate database peptides (excluding the best score) that were scored against that spectrum. One can then reference the score of the best match to the null distribution, and assign a significance measure to the best match. The further away the best match score is located from the core of the null distribution, the higher the statistical significance of the match (i.e., it is more likely to be correct). These steps essentially represent the classical p-value computation as the tail probability in the distribution generated from random matches. To calculate the p-value one either has to assume that the distribution of database search scores for any MS/MS spectrum follows a certain parametric (e.g. Poisson) distribution [163, 164] (with spectrum-specific parameters), to justify a (possibly non-parametric) theoretical derivation for the tail part of the distribution [165, 166], or to perform empirical fitting of the observed distribution of scores [162, 167]. Figure 5 provides a simple illustration of how this type of conversion can be done empirically. E-value is related to p-value but has a more convenient interpretation as the expected number of peptides with scores equal to or better than the observed best match score under the assumption that peptides are matching the acquired MS/MS spectrum by random chance. An alternative approach recently proposed in the literature is based on the concept of generating functions, which essentially considers the top score not relative to the distribution of scores for other (lower scoring) candidate database peptides, but rather with respect to the distribution obtained assuming the universe of all theoretically possible peptides .
The p-values or E-values computed as described above are single-spectrum statistical confidence measures. These scores, unlike the original search scores, are largely invariant under different scoring methods, and thus provide a clearer interpretation of the goodness of the identifications across different instruments, search algorithms, and the search parameters specified in those algorithms. However, the single-spectrum statistical measures are not sufficient when the analysis involves simultaneous processing of multiple MS/MS spectra. In those cases, a multiple testing correction of the individually computed p-values becomes necessary to account for multiple PSMs evaluated simultaneously. Even when requiring a very low p-value indicating high statistical significance for a particular PSM, in the presence of many MS/MS spectra in the dataset there could be many matches with similarly low p-values by random chance alone. At the same time, adjusting the threshold p-value to achieve a specified overall low error rate using classical adjustments such as “Bonferroni correction”  is known to produce overly conservative results (the Bonferroni correction, a representative family-wise error rate measure, was not designed for large size of datasets generated in genomics and proteomics applications). Therefore, additional modeling is required to calculate statistical measures more suitable for filtering of very large collections of PSMs.
In the case of large datasets, the most commonly used and accepted statistical confidence measures are the false discovery rate (FDR) as a summary statistics for the entire collection of PSMs, and the posterior probability or local FDR (denoted here as fdr) for individual PSMs. In the mass spectrometry terminology, FDR is defined as the expected proportion of incorrect PSMs among all accepted PSMs (a global property of a filtered list), whereas the posterior probability of an individual PSM (1 – fdr) is an estimated fraction of correct PSMs among a collection of PSMs having similar search scores (or similar p-values/E-values). The concept of FDR was pioneered by Benjamini and Hochberg, where the initial concept, the so called step-up FDR , was applied to ordered p-values under the assumption that all p-values are statistically independent (this and other assumptions have since been removed or relaxed in many following works). Other methods for controlling FDR include permutation-based methods  and Empirical Bayes [172, 173]. It should also be noted that the global methods (FDR and posterior probabilities of individual PSMs) and single-spectrum based approaches described above are complementary. In other words, global dataset modeling and FDR analysis can be performed for a set of PSMs ranked by individually computed p-values or E-values. It is also worth repeating that the computation of FDR and posterior probabilities for individual PSMs is based on the analysis of global distribution of PSM scores in the entire dataset (multiple MS/MS spectra, and considering the single top scoring PSM per spectrum), whereas the single-spectrum values such as p-values are based on modeling of the single-spectrum distribution (single MS/MS spectrum scored against all candidate database peptides).
In the area of MS/MS-based proteomics, the methods for computing FDR can be broadly grouped into two categories. The discussion here starts with a simple approach based on the use of the target-decoy database search strategy . The target-decoy approach requires minimal distributional assumptions and is easy to implement, which makes it easily applicable in a variety of situations. The strategy requires that experimental MS/MS spectra are searched against a target database of protein sequences appended with the reversed (or randomized, or shuffled) sequences of the same size (see Figure 6 for illustration). Alternatively, the searches against the target and decoy sequences can be performed separately (the differences between these two approaches are discussed below). A similar approach can be used for spectral library searching . The basic assumption is that matches to decoy peptide sequences (decoy PSMs) and false matches to sequences from the target database follow the same distribution. The plausibility of these assumption was discussed in . In the second step, PSMs are filtered using various score cut-offs (e.g. a certain X! Tandem E-value or MASCOT Ion Score cut-off), and the corresponding FDR for each cut-off is estimated as Nd/Nt, where Nt is the number of target PSMs with scores above the cut-off, and Nd is the number of decoy PSMs among them. The underlying assumption is that, given the equal size of the target and decoy database, the number of incorrect target PSMs, Ninc, can be estimated as the number of decoy PSMs. Alternatively, the FDR is sometime estimated as 2Nd/(Nt+Nd).
The accuracy of these FDR estimates depends on the details of the target-decoy database search. There are two commonly used options: two separate searches against the target and the decoy database, and a single search against the concatenated target plus decoy database [176-178]. In the separate target-decoy search strategy, FDR computed as Nd/Nt is a conservative estimate due to the fact that the number of decoy PSMs, Nd, overestimates Ninc when all MS/MS spectra are allowed to match to sequences in the decoy database (including spectra that can be matched correctly to sequences from the target database). This can be corrected by taking into account the proportion of PSMs in the dataset that are incorrect . Note that such a correction was not applied in many studies that used the separate target and decoy database search strategy, see e.g. [180-182]. In practice, a far more common option is to perform the search against the concatenated target-decoy database (see, e.g.  and references therein), which is less sensitive to this problem. Still, the combined strategy may also result in a conservative estimate due to the peptide competition effect: some MS/MS spectra may match to a peptide from the decoy database with a score higher than that the score of the true match. This, in turn, reduces the total number of correct PSMs (a negative characteristics by itself ), and thus increases the FDR. To address this, a modified approach has been suggested that is based on the analysis of the distributions of PSM scores obtained using two separate searches, but correcting for the competition effect .
Decoy sequences can be created using several methods, e.g. via randomization, shuffling, or simple reversal of the target sequences. A number of studies have investigated the differences between various decoy sequence generation methods, and generally found these differences to be insignificant [174, 183, 185]. A more fundamental issue, however, is that none of the existing decoy database creation methods capture all significant sources of false identifications. As previously noted in , incorrect peptide assignments can be considered as coming from two different sources: truly random matches, and incorrect matches to peptides homologous (directly in the sequence domain, or indirectly in the m/z domain) to the true peptides. When creating decoys using any common method, e.g. reversing or randomizing target protein sequences, it should be possible to derive an accurate representation of the distribution of random matches. The second source of the false positives, however, remains underestimated. While the problem may be less severe at the level of PSMs, it may produce a bias in the error rate estimates derived at the protein level. Even at the PSM level, the problem can be quite severe in the case of PTM identifications, or searches performed against large databases such as those generated based on six frame translation of sequences in genomic databases. For example, many high scoring PSMs corresponding to novel peptides resulting from searching against EST databases can be explained as likely false positives, with the true sequences being highly abundant peptides chemically modified during the sample preparation steps. So far, this problem remains underexplored, with only a few attempts to generate decoy databases using more sophisticated rules that preserve some of the sequence homology of the target database .
While the target-decoy strategy allows estimation of global FDR, it does not provide a statistical confidence score for individual PSMs. The analysis of large datasets can be carried out in a more informative fashion when the posterior probability that a particular PSM is correct (which is essentially the compliment of the local FDR, i.e. 1 – local FDR) is estimated and then utilized as the baseline measure to distinguish between correct and incorrect identifications. Posterior probabilities can be computed using a general class of mixture model-based approaches similar to that introduced by Efron et al. . In the context of peptide identification, the mixture model-based error rate analysis was introduced in , and implemented in the computational tool PeptideProphet. Mixture model is a statistical approach that explains the distribution of interest (here, the distribution of database search scores S observed for all PSMs in the dataset) as a mixture of multiple components (here, two: correct and incorrect PSMs), see Figure 7 for illustration. In this approach, the mixing proportion (fraction of all PSMs in the dataset that are correct, π1 in Figure 7) and other parameters governing the distributions of scores for correct and incorrect PSMs (f1(S) and f0(S), respectively) are estimated from the data using, e.g. the expectation maximization (EM) algorithm).
The outcome is a posterior probability computed for each individual PSM, which in turn can be used for probability-based filtering of the entire collection of PSMs (e.g. requiring the posterior probability of PSM P > 0.99). These posterior probabilities can also be used to estimate the FDR. Thus, the dataset can be filtered using the probability threshold that corresponds to a desired FDR. The method is illustrated in Figure 8. Often, PSM data is only minimally filtered and taken as input to the next level of the analysis, in which the probabilities are recomputed and the FDR control is carried out at the protein level (see section 7.2). A note on the choice of terminology: in , the FDR measure was described as the false positive identification error rate because the FDR terminology had not been established at the time yet. The explicit connection between the posterior probability computed in PeptideProphet and the local FDR has been discussed in . The term posterior error probability (PEP) is sometimes used instead of local FDR .
While the original mixture model approach for posterior probability calculation was parametric and unsupervised with respect to the distribution modeling step  (as illustrated in Figure 8), these limitations have since been relaxed. First, the method was extended to incorporate the information from decoy PSMs in the mixture estimation algorithm . The decoys are exploited by allowing their scores to contribute to the estimation of the incorrect PSM distribution only. This way, decoys effectively yield a stable reference distribution of incorrect PSMs, making the posterior probabilities and FDR estimates more accurate and robust. The use of decoys essentially makes the mixture modeling step semi-supervised, in the sense that the class labels are known a priori for some but not all PSMs. The parametric assumptions of the conventional mixture model approach can be relaxed as well. This can be achieved again by utilizing the decoy peptides for non-parametric estimation of the shape of the distribution for incorrect PSMs, simultaneously with the estimation of the parameters of the correct PSM distribution and the mixture proportion using a semi-parametric density estimation method . A more computationally intensive and thus time consuming m(the variable mixture component model ). Non-parametric approaches allows accurate modeling of the distribution of scores for any search engine without the need to select the shapes of the distributions of the mixture components that best fit the data.
These new models [187, 188] have been fully implemented in the current distribution of PeptideProphet , which now can process the results from X! Tandem, MASCOT, OMSSA, Phenyx, ProbID, InsPecT, and MyriMatch, in addition to the originally supported SEQUEST. The use of non-parametric methods for computing posterior probabilities has recently been reported in other tools as well [190-193]. Also of note is that the non-parametric models can be successfully applied to model the results of sequence tag-based methods , and should be applicable to spectral library searching as well. An alternative method to relax the parametric assumptions is to increase the number of (Gaussian) mixture components as in the variable mixture component model . This model, however, has drawback of being more computationally intensive and thus less practical. The number of mixture components can also be increased in a more computationally efficient way and without the use of decoys , although such an approach is likely to be less robust.
Several other classes of statistical methods for computing the probabilities of individual PSMs have been proposed that rely on more elaborate mathematical modeling of the underlying distributions of scores [195, 196]. While these methods may have some advantages over empirical approaches, they are harder to extend to incorporate multiple search scores and/or various auxiliary information. This type of information, as discussed in the next section, can significantly improve the statistical power of the computed confidence scores for separating true from false identifications.
The discussion above focused on the analysis of MS/MS datasets and the resulting PSMs that were sorted, filtered, and whose confidence was quantified using a single database search score. However, this represents a very incomplete picture of the entire peptide identification process.
First, in some scoring schemes (especially those based on cross correlation or shared ion counts), one can compute a number of derivative scores. For example, in addition to the main score of SEQUEST, Xcorr, one can define a class of “delta” scores of which the most commonly used one is ΔCn, the normalized difference between the Xcorr scores of the best and the second best scoring candidate database peptides for a given spectrum [25, 187]. These derivative scores are also useful for discriminating between true and false PSMs. In fact, in certain cases the delta scores may be more discriminative than the absolute scores (see section 6.4 for an additional discussion and illustration using a real dataset). It is also interesting to note that when the search score is converted to a p-value by referring it to the null (random) distribution as described above (single-spectrum statistical scores, Figure 5), the resulting p-value (E-value) reflect both the absolute score and the delta score, with a higher contribution from the latter. Similarly, the Energy-score of the generation function approach  is closely related to the concept of delta scores, albeit it represents the difference between the best database peptide score and the score of the best possible interpretation (i.e., the best score assuming the universe of all possible peptides). The original (absolute) scores and their derivatives are not independent, e.g., a strong correlation is observed between ΔCn and Xcorr. Still, these scores are complementary, i.e. combining both scores often allows improved discrimination between true and false PSMs [25-27, 29, 197-199].
The discrimination between true and false PSMs in large datasets can also be improved by including auxiliary information that may be generated in the course of the experiment (see Figure 9). These include: (1) mass accuracy, ΔM - the difference between the measured and calculated mass of the peptide ion available from the first stage of MS, MS1; (2) peptide separation coordinates such as retention time [199-201] or pI value [11, 202, 203] (peptide separation steps); (3) the number of peptide termini consistent with the type of enzymatic cleavage used (NTT) and the number of missed cleavage sites (NMC) (digestion step); (4) the presence of a specific amino acid or sequence motif, e.g., cysteine in the case of avidin affinity purification of peptides containing biotinylated cysteines, or the sequence motif N-X-S/T for peptides containing N-linked glycosylation sites  (peptide enrichment steps). The most general sources of the auxiliary information (ΔM, NTT, NMC) can be used to restrict the set of candidate database peptides during the database search step. Alternatively, these parameters and other auxiliary information can be used as a part of the post-database search statistical analysis.
It is possible to take into account the auxiliary information even in simple threshold based-filtering approaches [200, 203, 205-207] coupled with the target-decoy FDR estimation strategy. For example, after performing the database search, one may elect to filter out all PSMs with high ΔM values. In the case of pI-base peptide separation prior to MS analysis, one can filter out all peptide identifications having a calculated (based on the peptide sequence) pI value outside of the range of values expected for peptides from a particular sample fraction. However, in such a simple approach dealing with experimental variation (e.g., a bias in the mass measurement, or inaccurate determination of the expected pI value in each peptide fraction) can be problematic. In part, this can be addressed by using iterative mass [86, 208-212] or retention time [209, 213, 214] or pI data calibration steps. Alternatively, auxiliary information can be incorporated in mixture model-based methods and other advanced statistical approaches as described below.
While the lists of PSMs can be processed by applying multiple independent filters for both the search scores (e.g. Xcorr, ΔCn) and various auxiliary parameters, such an approach is suboptimal and requires sufficiently large dataset size due to the need to subdivide PSMs into multiple subcategories. This is where statistical modeling approaches such as PeptideProphet have a particular advantage. By jointly modeling multiple sources of information (the search scores and the auxiliary information such as ΔM, NTT, NMC and etc.) [8, 25, 202], PeptideProphet has an inherent flexibility to detect and correct for a measurement bias (e.g. in mass measurement), and to weigh the contributions of the different types of information in an experiment-specific manner when computing posterior probabilities.
The approach implemented in PeptideProphet represents a combination of supervised and unsupervised modeling. Its supervised part is related to the calculation of the combined score (called “discriminant database search score” or just “discriminant score” in PeptideProphet). When multiple scores are used (e.g. Xcorr, ΔCn in SEQUEST; hyperscore and its derivative delta score in X! Tandem), they are combined using the discriminant function developed based on training data [25, 187]. The rest of the analysis for computing the posterior probabilities and FDR estimation is carried out in an unsupervised fashion in which the discriminant score distributions among correct and incorrect PSMs are learned from each dataset anew using the EM algorithm. To include the auxiliary information, the joint distribution of the search score (or the discriminant database search score) and the auxiliary information is modeled as a multivariate mixture distribution with two components representing correct and incorrect PSMs, respectively (see Figure 10 for illustration). It is assumed that conditional on the identification status, the marginal distributions of the individual variables are independent, which is generally the case with variable used in PeptideProphet. Appending decoys to the sequence database enables semi-supervised and semi-parametric mixture modeling for improved accuracy of posterior probability estimates, as discussed above.
An extension to PeptideProphet that is based on an adaptive approach for computing the coefficients in the discriminant function for combining multiple database scores has being investigated in a recent work . Such an approach removes most of the reliance on the training dataset. The conclusion reached in that study was that the improvement from using the adaptive approach was not significant, except for highly constrained searches (i.e., very narrow mass tolerance, allowing tryptic peptides only). In the case of highly constrained searches the discriminating power of the delta scores (e.g. ΔCn in that study) significantly diminishes reflecting increased variability caused by a significant reduction in the number of candidate database peptides per searched MS/MS spectrum (discussed below in more detail, see section 6.4). As a side note, this observation raises an important question of whether the conventional calculation of the p-values or E-values as described in section 6.1, or even the calculation of the search scores themselves when those scores implement certain elements of probabilistic modeling (e.g. MASCOT's Ion Score) are reliable in the case of such highly constrained searches.
Percolator is another computational tool based on a machine learning approach that utilizes multiple scores and computes posterior error probabilities and FDR (or, more precisely, a closely related measure called q-score) for a set of PSMs. Percolator reduces the dependence on the training data via a dynamic learning approach , in which the original fully supervised approach  based on a support MASCOT  search results. The concept of adaptive (dynamic) training and direct optimization were also recently explored by other groups with a specific focus on phosphorylated peptides [217, 218]. Finally, improved discrimination can be achieved by combining the output from two or more different database search tools [154, 155, 198, 219-223], or by combining data from multiple consecutive stages of mass spectrometry (e.g., MS2 and MS3) .
In peptide identification by database searching, the average number of candidate peptides selected for scoring against each MS/MS spectrum has a significant effect on the downstream statistical analysis and the overall success rate. The question of what search parameters are optimal for deriving the highest number of correct identifications at a fixed FDR is of great practical importance. This question is particularly relevant in the case of high mass accuracy instruments such as LTQ-Orbitrap, where performing searches with very narrow mass tolerance is an attractive option . It has also been debated whether the search should be limited to tryptic peptides only [224, 225]. These questions were recently investigated in , with a somewhat counterintuitive observation regarding the outcome of opening up the search space (i.e. using higher mass tolerance than necessary given the accuracy of the instrument). While it leads to an increase in the number of candidate database peptides, and thus an increased possibility of a false match, with the help of the mass accuracy parameter ΔM at the subsequent data analysis stage there may be a net positive effect.
This is further illustrated in Figure 11, which presents the results of two SEQUEST database searches of the same control protein mixture dataset generated on a high mass accuracy LTQ-FT instrument (see  for details). The searches were carried out allowing tryptic peptides only and with either narrow precursor ion mass tolerance (0.01 Dalton, which translates into ~ 10 ppm for peptides with a singly protonated peptide mass around 1000 Dalton), or with a wide mass tolerance (3 Dalton). The effect of the mass tolerance on the distribution of Xcorr score among correct and incorrect PSMs was rather minimal. By opening up the search space, the distribution of Xcorr scores for incorrect PSMs shifted, as expected, slightly toward higher scores, reflecting the large pool of candidate database peptides available for (random) matching (compare Figure 11a and b).
The effect of mass tolerance on ΔM distributions was obviously far more pronounced than in the case of Xcorr score. The distribution of ΔM scores for correct PSMs is always centered on 0 (correcting for a small mass measurement bias). At the same time, with increasing mass tolerance, ΔM distribution for incorrect PSMs becomes distributed across a wider range of possible values [187, 188]. This in effect increases the proportion of correct vs. incorrect PSMs with ΔM close to 0. The more pronounced are the differences between ΔM distributions observed for correct and incorrect PSMs, the more discriminant ΔM becomes for filtering the data. In fact, Figure 11f (wide mass tolerance search) shows that ΔM in the higher (but not too high) FDR range becomes a more useful score for separating true from false PSMs than Xcorr and ΔCn scores. In the case of narrow mass tolerance search (Figure 11e), ΔM is only marginally useful since high mass accuracy is already utilized as a filter in selecting candidate database peptides.
At the end, PeptideProphet learns the distributions of ΔM values (and other scores) from the data and factors them into computing the posterior probabilities. The more pronounced are the differences between ΔM distributions among correct and incorrect identifications, the more discriminating the computed probabilities become for filtering the data. As a result, when filtering the data using posterior probabilities computed by PeptideProphet, it is possible to identify a larger number of PSMs at the same FDR in the case of wide mass tolerance search compared to the narrow one (Figure 11g). A similar trend is often true for several other peptide properties, most notably the number of tryptic termini NTT.
To supplement the discussion on the role of the delta scores (see section 6.3 above), Figure 11e and 11f also show the discriminating power of ΔCn score. Interestingly, in the case of wide mass tolerance search ΔCn is more discriminative than Xcorr. In the case of narrow mass tolerance search, however, ΔCn becomes far less useful. The figure also plots the results of filtering the data using the posterior probabilities computed by PeptideProphet. By combining the information from all three scores discussed here, Xcorr, ΔM and ΔCn, in the case of wide mass tolerance search the posterior probabilities are significantly better at separating correct from incorrect PSMs than each score on its own (in the most relevant range of FDR below 0.05). The advantage of posterior probabilities compared to Xcorr for filtering the data is less significant in the case of narrow mass tolerance search because, as discussed above, ΔM and ΔCn in this case do not significantly contribute to discrimination in the range of low FDR values.
It is also worth pointing out that opening up the search space effectively creates a decoy peptide population without actually adding decoy sequences to the database . For example, in an enzyme unconstrained search, each spectrum is compared against a set of candidate peptides that includes peptides not likely to be present in the analyzed sample (i.e., non-tryptic and, to a large degree, semi-tryptic peptides). In the case of protein digestion with trypsin under most common conditions (and not considering samples with a high degree of protein degradation such as serum or plasma), less than 1% of all correct PSMs are non-tryptic peptides (NTT=0), and typically less than 10% are semi-tryptic (NTT=1). In contrast, the majority of incorrect PSMs have NTT=0 or 1. Thus, the shape of the distribution of scores for PSMs with NTT=0 or 1 is a good representation of the distribution of scores of all incorrect PSMs. In other words, PSMs with NTT=0 (and most of PSMs with NTT=1) serve as internal decoys (pseudo-decoys). The same logic applies to other auxiliary scores (again, most notably ΔM in the case of wide mass tolerance searches). As a result, PeptideProphet is able to deconvolute the observed distribution of database search scores even if it does not appear bimodal (see Figure 12). This, in turn, explains in part why PeptideProphet is able to derive accurate posterior probabilities even without the use of artificial decoys added to the database, as long as the database search is performed in not an overly constrained manner. Still, adding at least some decoys to the sequence database is desirable in the case of highly constrained searches or challenging datasets as it enables semi-supervised and semi-parametric modeling (see section 6.2 above) for improved robustness of the mixture model approach.
One obvious drawback of performing unconstrained searches that may negate all potential benefits of opening up the search space is the substantial increase in the database search time. Furthermore, the optimal settings for performing the searches are likely to be different for different search tools and post-database search processing options. For example, opening up the search space is generally not recommended in the case of MASCOT, in part due to lower sensitivity of the scoring function implemented in that tool compared to that of SEQUEST. The nature of the analyzed sample and the experimental protocols used to generate MS data are bound to play an important role as well. Thus, future work in this area should include a more detailed analysis of the optimality of the database search condition in different setting.
In most proteomic studies researchers are interested in the list of identified proteins, and the statistical validation of PSMs described above serves as an intermediate step. To derive the protein summary list, PSMs need to be grouped according to their corresponding protein, and the posterior probabilities and FDR need to be recomputed at the protein level. Many database search tools already provide a protein-centric view of the data in addition to the ranked list of PSMs. However, these tools are not adequate for representing the results of large-scale studies. This is due to the fact that in most studies one has to deal with multiple datasets acquired and processed at different times, and because, as discussed above, post-database search statistical modeling allows improved discrimination between correct and incorrect identifications.
Several difficulties have been identified that complicate the process of assembling peptides into proteins: (1) non-random grouping of peptides to proteins, resulting in an amplification of error rates going from PSM to unique peptide to protein level [8, 134]; (2) the loss of connectivity between peptides and proteins due to protein digestion creating the protein inference problem .
The first problem is illustrated in Figure 13. The mapping of correct PSMs to proteins is an abundance-driven process, reflecting the fact that more abundant proteins are identified by a higher number of unique peptides and PSMs (as a side note, the relationship between the number of PSMs and the protein abundance can be model using Poisson distribution ). For example, in a typical shotgun proteome profiling experiment of a fairly complex organism having 20,000 genes (proteins), a typical outcome would be the identification of ~ 1000 proteins from an order of magnitude higher number of correct PSMs (filtered at a low FDR). Thus, correct PSMs tend to group into a relatively small number of proteins compared to the size of the proteome of the organism. In contrast, incorrect PSMs are due to semi-random matching to any of the entries (20,000 in this example) from the sequence database. The non-randomness here comes from the differences between proteins in terms of their sequence length, and due to the homology problem that will be discussed later. As a result, in a typical experiment almost every high scoring incorrect PSM adds another incorrect protein identification. This has an important implication in that even a small FDR at the PSM level can translate into a high FDR at the protein level. This effect becomes more pronounced as the number of MS/MS spectra in the dataset increases relative to the number of identifiable proteins in the sample. It also generally makes the identification of proteins based on a single peptide, many of which are low abundance proteins, more difficult.
The second problem is related to the presence of shared peptides, i.e. peptides whose sequence is present in more than a single entry in the protein sequence database. Such cases most often result from the presence of homologous proteins, alternative splice variants, or redundant entries in the sequence database, and make it difficult to infer the particular corresponding protein (or proteins) present in the sample [50, 227]. Shared peptides are fairly abundant in the case of higher eukaryote organisms. As a result, in shotgun proteomics it is often not possible to differentiate between different protein isoforms. A detailed discussion of the difficulties in interpreting the results of shotgun proteomic experiments at the protein level can be found in .
For the sake of clarity, the discussion below will first ignore the problem of shared peptides. In this case, the main task of the protein-level modeling is to group PSMs into proteins, and then calculate a statistical confidence score for each protein identification (or, at the minimum, to determine the protein-level FDR for a filtered protein list). A simple and commonly used approach is to apply various filters at the peptide level (e.g. filter the list of PSMs using the database search score(s), E-values, or posterior probabilities) as to achieve a desired protein-level FDR estimated using the target-decoy strategy. The alternative is to perform more advanced analysis combining the evidence from multiple PSMs corresponding to each protein. In statistical methods, the starting point could be the posterior probabilities of PSMs (computed, e.g., using PeptideProphet) and the peptide to protein mappings. The outcome then would be (a) protein posterior probabilities (or just scores) that allows more efficient filtering of data at the protein level and (b) the knowledge of FDR corresponding to each protein probability (score) threshold used to filter the data.
The process of computing protein probabilities necessary involves making various assumptions. A protein can be identified from multiple different peptides. In turn, each peptide can be identified from multiple peptide ions, e.g. from a doubly and a triply charged peptide ion, or in a modified (e.g. phosphorylated) and unmodified forms. In addition, each peptide ion may be sequenced multiple times (redundant PSMs). Additional factors affecting the confidence in the protein identification include the length of the protein (or, more precisely, the number of expected tryptic peptides), the total amount of MS/MS data collected, the size of the protein sequence database, and the number and the dynamic range of proteins in the sample. The question of how to combine different sources of evidence in computing protein probabilities and to account for the factors mentioned above is an active area of research.
In combining the evidence from multiple PSMs corresponding to the same protein, one can simply select the best (i.e. highest score/probability) PSM and use its score as the protein score (the “best peptide” model in Figure 14). This approach can be further extended to require that a protein is identified by not less than a certain number of peptides. For example, in the “two peptide rule” the protein score is essentially computed as 1 if the protein is identified by two or more PSMs with a score above a certain threshold, and 0 otherwise. In doing so, it is typically required that the protein is identified by two different peptides, because redundant PSMs identifying the same peptide sequence cannot be considered as independent events. As a intermediate approach (implemented, e.g., in ProteinProphet), one may count as different peptides the identifications of the unmodified and a modified version of the same peptide, or the identification of a peptide from MS/MS spectra of different charge states.
Several statistical methods for combining the evidence from multiple PSMs in computing the protein probabilities (scores) have been reported as well. One approach is to assume that, in the case of incorrect PSMs, the number of such PSMs mapping to a protein follows a certain parametric distribution (e.g. Poisson), leading to the computation of a protein confidence score similar to the conventional p-value statistics. In doing so, the method incorporates the number of PSMs passing a certain minimum score threshold (but not the confidence in individual identifications), the overall size of the database, and the length of each protein [12, 228]. The protein abundance can also be modeled as a latent variable . Statistical methods based hierarchical modeling of peptide and protein identification data have also been recently proposed [229-231], and have certain theoretical advantages compared to the existing simpler approaches. These new methods should be further evaluated in future work, provided the software tools implementing these advanced methods become available.
One commonly used approach, exemplified by the computational model of ProteinProphet, is to compute a cumulative score [223, 232, 233]. ProteinProphet takes as input a list of PSMs and their posterior probabilities (the output from PeptideProphet), and computes a probability that a protein is present in the sample by combining together the probabilities of its corresponding PSMs. However, using the initial PSM probabilities would result in a significantly overestimated probability for many proteins, most notably those identified by a single peptide. This is a direct consequence of the non-random grouping problem mentioned above (see Figure 13). To further illustrate this, assume that all 10 peptides shown in Figure 13 have a posterior probability of 0.8 (i.e. in perfect agreement with the actual FDR of 0.2). At the protein level, these accurate peptide probabilities would translate (using the combined peptide evidence equation shown in Figure 13) into a 0.998 probability for protein A, 0.992 for protein B, and 0.8 for proteins X1, X2, and C. These protein-level estimates are clearly not accurate, as they predict that there is less than one incorrect protein within the list (0.61 to be precise, FDR = 0.12), whereas the actual number is 2 (FDR = 0.4).
To address this problem, ProteinProphet implements an adjustment of the initial PSM probabilities (p → p′ in Figure 13) to account for the protein grouping information - the number of sibling peptides (NSP). Via this adjustment, the method penalizes, i.e. reduces the probabilities of peptides corresponding to ‘single hit’ proteins such as proteins X1, X2, and C in Figure 13, and rewards those corresponding to ‘multi-hit’ proteins (proteins A and B). The appropriate amount of adjustment (reflected in the ratio of the NSP distributions, f0(NSP) and f1(NSP)) depends on the sample complexity, the number of acquired MS/MS spectra, and other factors, and is determined automatically for each dataset via an iterative procedure. In this example, the ideal outcome would be a reduction of the initial probabilities of peptides 2, 8, 9, and 10 (that have no siblings) from 0.8 to ~ 0.3, resulting in the computed probability of 0.3 for proteins X1, X2, and C, in agreement with the actual protein-level FDR.
Application of more stringent filtering criteria to single hit protein identifications (in ProteinProphet, via the penalty described above) is necessary to keep the error rates under control. However, eliminating all single hit proteins from the final protein summary list is in most cases a suboptimal approach given that these proteins represent 20-30% of all correctly identified proteins in a typical shotgun proteomic dataset. Despite applying the penalty, ProteinProphet does not exclude proteins identified by a single peptide when the peptide has very high posterior probability. In other words, for each protein, the method considers the quality of the supporting evidence (i.e., the peptide probability) in addition to considering the quantity (the number of identified peptides for that protein). While this goes against the commonly used “two peptide rule”, other recent reports also argue in favor of such an approach . To paraphrase the words of Anacharsis (6th century BC) about friendship, “it is better to have one good peptide than many of worthless ones”. Furthermore, empirical evidence suggests that this statement is even more true in the case of very large datasets, where filtering the protein identifications using the simple “best peptide” approach is actually more efficient than using the combined peptide evidence approach.
In evaluating the performance of computational methods for computing posterior protein identification probabilities, one should consider two related but distinct metrics: discriminating power of the probability as a score for separating correct from incorrect protein identifications and the accuracy of the probabilities (i.e. whether they can be considered as true posterior probabilities or just as scores). The later question is important for estimating the FDR at the protein level. If the posterior protein identification probability is accurate, then FDR can be estimated without adding decoy protein sequences to the database via the sum of the posterior probabilities of all identifications passing the threshold  (as illustrated in Figure 8 in the case of PSMs). This has a number of advantages, especially in the case of small datasets where simple decoy count-based FDR estimates may not be reliable. For example, it is not possible to reliably estimate FDR based on decoy counts in the case of experiments profiling samples containing less than a few hundred proteins. In those cases where the computed score has no expectation of being a true posterior probability (as in the best peptide approach mentioned above), FDR can only be estimated with a help of decoy sequences analogous to the methods used at the PSM level [174, 182].
As in the case of PSM-level analysis, protein-level analysis may benefit from incorporation of various sources of auxiliary information or data generated in parallel experiments, e.g. predicted peptide detectability [223, 235] and external data such as transcriptomic data, interaction networks, and pathway information [236-238]. In the absence of well defined benchmark datasets, evaluating the accuracy of data analysis methods becomes difficult. Computed protein probabilities thus should be considered as just one source of information (the best of what one can do computationally), and protein identifications of biological importance but with borderline statistical confidence should be confirmed by independent technical and biological replication of the experiment, or using alternative strategies such targeted protein identification using SRM .
In the presence of shared peptides (i.e., peptides whose sequence is present in multiple entries in the protein sequence database), the task of computing protein confidence scores becomes more complicated. Even when using simple filtering approaches, a choice has to be made as to what degree one should utilize shared peptides. While considering only non-shared peptides is an overly conservative approach, treating shared and non-shared peptides equally erroneously inflates the number of reported proteins identifications.
The grouping of peptides to protein sequences can be done deterministically [182, 198, 240-242], or probabilistically, e.g. by apportioning peptides to proteins with some weights [50, 134, 155] or using graph-transforming algorithms . An alternative approach  sidesteps the process of spectral identification, combines overlapping uninterpreted MS/MS spectra into longer chains, and maps these chains to protein sequences directly. With both approaches, combining peptides into proteins is often insufficient for unambiguous identification of the protein form due to a large number of shared peptides. This is particularly true in those cases where the protein sequence database contains many homologous proteins and splice isoforms (e.g. in the analysis of higher eukaryotes), or when the database intentionally includes sequences from multiple organisms .
In early studies, some groups were reporting all proteins identified with at least one non-shared peptide, whereas others reported everything or selected one representative protein among isoforms or homologs . Many currently used tools present the results in a more transparent format by creating protein groups. This approach, and the nomenclature for describing various grouping scenarios, is partly based on the parsimony principle or the Occam's razor – “entities must not be multiplied beyond necessity” - which suggests that one should report the smallest number of proteins (protein groups) that can account for all observed peptides . In this approach, protein database entries that are indistinguishable given the sequences of identified peptides are collapsed into a single protein group. Other scenarios include subset proteins, i.e. proteins that share all of its peptides with another protein that is identified by at least one non-shared peptide, and other more complicated cases . Such a nomenclature provides a more consistent and concise format for representing the results of shotgun proteomic experiments (for a simple illustration see Figure 15).
In certain cases, e.g. for comparison of proteomic and transcriptomic data, or to simplify the visualization of proteomic data in genomic context, it is advantageous to assemble and interpret the data using not a protein but a gene index as a reference. To achieve this, one can map peptides directly to the genome (or utilize protein-to-gene mappings already available for some protein sequence databases), and collapse the protein groups to keep unique gene accession numbers only. More elaborate gene model – protein sequence – protein accession relationships have also been suggested . Interpretation of the results at the gene level has an additional benefit of providing more conservative protein lists by eliminating erroneous identifications of homologous proteins. One common example of this kind would be a minor isoform of a protein reported as unambiguously identified by a single non-shared peptide which is in fact a false identification, and where the true (highly homologous) peptide sequence belongs to another protein isoforms of the same gene.
Assessment of error rates in very large proteomic datasets remains an important issue. Given the fast pace of proteomic technology development, the definition of the term “large-scale” is frequently revised. What used to be a large dataset just a few years ago nowadays can be generated on a single instrument in a day. In addition to the dramatic increase in the speed of data acquisition, many datasets are now collected in multiple replicates. Many applications are concerned with generation of highly overlapping datasets, e.g. label-free quantitative analysis across multiple samples , or comprehensive characterization of the proteomes of model organisms such as Drosophila  or C. elegans . This has a profound effect on many aspects of proteomic data analysis, most notably on protein-level filtering and error rate estimation. While the methods and approaches for global and local FDR estimation at the peptide level are now well understood, there has been much less development in the area of protein level analysis. While ProteinProphet  provides accurate probability estimates for data from typical shotgun proteomic experiments, our recent evaluation of its performance using large multi-replicate datasets indicated that the approach does not always scale well with the dataset size, in line with other recent reports . More advanced, recently proposed methods cannot be easily implemented, whereas methods that are simple enough to implement perform poorly due oversimplifying assumptions (e.g. the Poisson model ). Thus, besides the simple target-decoy approach, the field is essentially lacking a robust and statistically powerful model with a practical software implementation for the analysis of very large proteomic datasets. Several efforts to address this problem are currently under way, including the development of a new computational tool iProphet (manuscript in preparation).
Searching genomic databases is becoming an increasingly popular approach, but should be practiced with great caution. Accurate translation of the DNA sequences into protein sequences is complicated due to frame-shifts, incorrectly predicted open reading frames, and other factors. Performing an accurate FDR assessment becomes difficult due to the need to distinguish between the identification of a potentially novel peptide and, e.g., a chemically modified version if a known, high abundance peptide (homologous to the novel sequence). Furthermore, the FDR estimation methods should take into account a much lower prior likelihood of observing a novel peptide compared to a known peptide (same statement applies to the problem of identifying rare PTMs). To our knowledge, FDR assessment in many published studies focusing on genomic database searching did not go beyond the standard procedures, and thus the true error rate for datasets reported in those studies may not be known. Significant concerns also remain regarding the application of multi-stage peptide and protein identification methods due to the violation of many common assumption about the rates of false identifications in those strategies [152, 251].
Global analysis of PTMs is one area of research that has become possible with recent improvements in MS instrumentation and biochemical methods. There is a tremendous interest in a variety of biologically significant PTMs, including phosphorylation (cell signaling), acetylation, and methylation (epigenetics) . Yet PTM analysis presents many challenges [252, 253]. Using phosphorylation as an exemplary PTM, not only phosphorylated peptides are typically present in the cell at substoichiometric amounts, but their MS/MS fragmentation is often dominated by neutral losses of the phosphate group making the spectra uninformative [254-256]. As a result, sensitive identification of phosphopeptides requires application of complementary data acquisition techniques, e.g. MS3, multi-stage activation (MSA) [257-259], and/or electron transfer dissociation (ETD) . While there has been significant progress in the development of the experimental platforms, computational algorithms for PTM analysis require substantial improvements. In particular, statistical validation of peptides with PTMs is of great concern. Another challenge is localizing the site of the PTM. The existing peptide identification tools were neither optimized for PTM site localization, nor do they provide any confidence score for the assigned site. While several site localization algorithms were previously described [261-263], they implement simplified approaches and have not been rigorously tested. As a result, phosphopeptide identifications often need to be manually verified – an obviously time consuming and subjective process. As well summarized in , “false-positive identifications are particularly dangerous for biologists interested in studying the function of these selected phosphorylation sites, as each phosphorylation site may take one to two years to fully investigate”. The problems related to PTM analysis are also acknowledged in the data publication guidelines from the journal Molecular and Cellular Proteomics .
It can be expected that improvement in tools and methods for shotgun proteomic data analysis will further increase the quality of published proteomic data. Still, such issues as the large dynamic range of protein abundances within the cell  and the limited ability to differentiate between protein isoforms cannot be easily overcome. Thus, it is possible that new significant advances will come from the development of alternative proteomic workflows. One possibility comes in the form of integration of shotgun and intact protein sequencing, which has a potential to provide improved resolution of populations of proteins into its components (including splice isoforms and post-translationally modified forms) [266-269]. Another promising alternative involves targeted analysis of specific peptides of high information content (proteotypic peptides) using SRM or related methods. These and other emerging strategies, however, present their own data analysis issues. Thus, there is every indication to expect that the field of computational proteomics in general, and the development of new algorithms and data analysis tools in particular, will remain an active area of research for years to come.
This work was supported in part by NIH/NCI Grant R01 CA126239.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.