7.1 Issues complicating protein level analysis
Several difficulties have been identified that complicate the process of assembling peptides into proteins: (1)
non-random grouping of peptides to proteins, resulting in an amplification of error rates going from PSM to unique peptide to protein level [
8,
134]; (2) the loss of connectivity between peptides and proteins due to protein digestion creating the
protein inference problem [
50].
The first problem is illustrated in . The mapping of correct PSMs to proteins is an
abundance-driven process, reflecting the fact that more abundant proteins are identified by a higher number of unique peptides and PSMs (as a side note, the relationship between the number of PSMs and the protein abundance can be model using Poisson distribution [
226]). For example, in a typical shotgun proteome profiling experiment of a fairly complex organism having 20,000 genes (proteins), a typical outcome would be the identification of ~ 1000 proteins from an order of magnitude higher number of correct PSMs (filtered at a low FDR). Thus, correct PSMs tend to group into a relatively small number of proteins compared to the size of the proteome of the organism. In contrast, incorrect PSMs are due to semi-random matching to any of the entries (20,000 in this example) from the sequence database. The non-randomness here comes from the differences between proteins in terms of their sequence length, and due to the homology problem that will be discussed later. As a result, in a typical experiment almost every high scoring incorrect PSM adds another incorrect protein identification. This has an important implication in that even a small FDR at the PSM level can translate into a high FDR at the protein level. This effect becomes more pronounced as the number of MS/MS spectra in the dataset increases relative to the number of identifiable proteins in the sample. It also generally makes the identification of proteins based on a single peptide, many of which are low abundance proteins, more difficult.
The second problem is related to the presence of shared peptides, i.e. peptides whose sequence is present in more than a single entry in the protein sequence database. Such cases most often result from the presence of homologous proteins, alternative splice variants, or redundant entries in the sequence database, and make it difficult to infer the particular corresponding protein (or proteins) present in the sample [
50,
227]. Shared peptides are fairly abundant in the case of higher eukaryote organisms. As a result, in shotgun proteomics it is often not possible to differentiate between different protein isoforms. A detailed discussion of the difficulties in interpreting the results of shotgun proteomic experiments at the protein level can be found in [
50].
7.2 Computing protein probabilities
For the sake of clarity, the discussion below will first ignore the problem of shared peptides. In this case, the main task of the protein-level modeling is to group PSMs into proteins, and then calculate a statistical confidence score for each protein identification (or, at the minimum, to determine the protein-level FDR for a filtered protein list). A simple and commonly used approach is to apply various filters at the peptide level (e.g. filter the list of PSMs using the database search score(s), E-values, or posterior probabilities) as to achieve a desired protein-level FDR estimated using the target-decoy strategy. The alternative is to perform more advanced analysis combining the evidence from multiple PSMs corresponding to each protein. In statistical methods, the starting point could be the posterior probabilities of PSMs (computed, e.g., using PeptideProphet) and the peptide to protein mappings. The outcome then would be (a) protein posterior probabilities (or just scores) that allows more efficient filtering of data at the protein level and (b) the knowledge of FDR corresponding to each protein probability (score) threshold used to filter the data.
The process of computing protein probabilities necessary involves making various assumptions. A protein can be identified from multiple different peptides. In turn, each peptide can be identified from multiple peptide ions, e.g. from a doubly and a triply charged peptide ion, or in a modified (e.g. phosphorylated) and unmodified forms. In addition, each peptide ion may be sequenced multiple times (redundant PSMs). Additional factors affecting the confidence in the protein identification include the length of the protein (or, more precisely, the number of expected tryptic peptides), the total amount of MS/MS data collected, the size of the protein sequence database, and the number and the dynamic range of proteins in the sample. The question of how to combine different sources of evidence in computing protein probabilities and to account for the factors mentioned above is an active area of research.
In combining the evidence from multiple PSMs corresponding to the same protein, one can simply select the best (i.e. highest score/probability) PSM and use its score as the protein score (the “best peptide” model in ). This approach can be further extended to require that a protein is identified by not less than a certain number of peptides. For example, in the “two peptide rule” the protein score is essentially computed as 1 if the protein is identified by two or more PSMs with a score above a certain threshold, and 0 otherwise. In doing so, it is typically required that the protein is identified by two different peptides, because redundant PSMs identifying the same peptide sequence cannot be considered as independent events. As a intermediate approach (implemented, e.g., in ProteinProphet), one may count as different peptides the identifications of the unmodified and a modified version of the same peptide, or the identification of a peptide from MS/MS spectra of different charge states.
Several statistical methods for combining the evidence from multiple PSMs in computing the protein probabilities (scores) have been reported as well. One approach is to assume that, in the case of incorrect PSMs, the number of such PSMs mapping to a protein follows a certain parametric distribution (e.g. Poisson), leading to the computation of a protein confidence score similar to the conventional
p-value statistics. In doing so, the method incorporates the number of PSMs passing a certain minimum score threshold (but not the confidence in individual identifications), the overall size of the database, and the length of each protein [
12,
228]. The protein abundance can also be modeled as a latent variable [
155]. Statistical methods based hierarchical modeling of peptide and protein identification data have also been recently proposed [
229-
231], and have certain theoretical advantages compared to the existing simpler approaches. These new methods should be further evaluated in future work, provided the software tools implementing these advanced methods become available.
One commonly used approach, exemplified by the computational model of ProteinProphet, is to compute a cumulative score [
223,
232,
233]. ProteinProphet takes as input a list of PSMs and their posterior probabilities (the output from PeptideProphet), and computes a probability that a protein is present in the sample by combining together the probabilities of its corresponding PSMs. However, using the initial PSM probabilities would result in a significantly overestimated probability for many proteins, most notably those identified by a single peptide. This is a direct consequence of the non-random grouping problem mentioned above (see ). To further illustrate this, assume that all 10 peptides shown in have a posterior probability of 0.8 (i.e. in perfect agreement with the actual FDR of 0.2). At the protein level, these accurate peptide probabilities would translate (using the combined peptide evidence equation shown in ) into a 0.998 probability for protein A, 0.992 for protein B, and 0.8 for proteins X1, X2, and C. These protein-level estimates are clearly not accurate, as they predict that there is less than one incorrect protein within the list (0.61 to be precise, FDR = 0.12), whereas the actual number is 2 (FDR = 0.4).
To address this problem, ProteinProphet implements an adjustment of the initial PSM probabilities (p → p′ in ) to account for the protein grouping information - the number of sibling peptides (NSP). Via this adjustment, the method penalizes, i.e. reduces the probabilities of peptides corresponding to ‘single hit’ proteins such as proteins X1, X2, and C in , and rewards those corresponding to ‘multi-hit’ proteins (proteins A and B). The appropriate amount of adjustment (reflected in the ratio of the NSP distributions, f0(NSP) and f1(NSP)) depends on the sample complexity, the number of acquired MS/MS spectra, and other factors, and is determined automatically for each dataset via an iterative procedure. In this example, the ideal outcome would be a reduction of the initial probabilities of peptides 2, 8, 9, and 10 (that have no siblings) from 0.8 to ~ 0.3, resulting in the computed probability of 0.3 for proteins X1, X2, and C, in agreement with the actual protein-level FDR.
Application of more stringent filtering criteria to single hit protein identifications (in ProteinProphet, via the penalty described above) is necessary to keep the error rates under control. However, eliminating all single hit proteins from the final protein summary list is in most cases a suboptimal approach given that these proteins represent 20-30% of all correctly identified proteins in a typical shotgun proteomic dataset. Despite applying the penalty, ProteinProphet does not exclude proteins identified by a single peptide when the peptide has very high posterior probability. In other words, for each protein, the method considers the quality of the supporting evidence (i.e., the peptide probability) in addition to considering the quantity (the number of identified peptides for that protein). While this goes against the commonly used “two peptide rule”, other recent reports also argue in favor of such an approach [
234]. To paraphrase the words of Anacharsis (6
th century BC) about friendship, “it is better to have one good peptide than many of worthless ones”. Furthermore, empirical evidence suggests that this statement is even more true in the case of very large datasets, where filtering the protein identifications using the simple “best peptide” approach is actually more efficient than using the combined peptide evidence approach.
In evaluating the performance of computational methods for computing posterior protein identification probabilities, one should consider two related but distinct metrics: discriminating power of the probability as a score for separating correct from incorrect protein identifications and the accuracy of the probabilities (i.e. whether they can be considered as true posterior probabilities or just as scores). The later question is important for estimating the FDR at the protein level. If the posterior protein identification probability is accurate, then FDR can be estimated without adding decoy protein sequences to the database via the sum of the posterior probabilities of all identifications passing the threshold [
134] (as illustrated in in the case of PSMs). This has a number of advantages, especially in the case of small datasets where simple decoy count-based FDR estimates may not be reliable. For example, it is not possible to reliably estimate FDR based on decoy counts in the case of experiments profiling samples containing less than a few hundred proteins. In those cases where the computed score has no expectation of being a true posterior probability (as in the best peptide approach mentioned above), FDR can only be estimated with a help of decoy sequences analogous to the methods used at the PSM level [
174,
182].
As in the case of PSM-level analysis, protein-level analysis may benefit from incorporation of various sources of auxiliary information or data generated in parallel experiments, e.g. predicted peptide detectability [
223,
235] and external data such as transcriptomic data, interaction networks, and pathway information [
236-
238]. In the absence of well defined benchmark datasets, evaluating the accuracy of data analysis methods becomes difficult. Computed protein probabilities thus should be considered as just one source of information (the best of what one can do computationally), and protein identifications of biological importance but with borderline statistical confidence should be confirmed by independent technical and biological replication of the experiment, or using alternative strategies such targeted protein identification using SRM [
239].
7.3 Protein inference and presentation of the results at the protein level
In the presence of shared peptides (i.e., peptides whose sequence is present in multiple entries in the protein sequence database), the task of computing protein confidence scores becomes more complicated. Even when using simple filtering approaches, a choice has to be made as to what degree one should utilize shared peptides. While considering only non-shared peptides is an overly conservative approach, treating shared and non-shared peptides equally erroneously inflates the number of reported proteins identifications.
The grouping of peptides to protein sequences can be done deterministically [
182,
198,
240-
242], or probabilistically, e.g. by apportioning peptides to proteins with some weights [
50,
134,
155] or using graph-transforming algorithms [
243]. An alternative approach [
244] sidesteps the process of spectral identification, combines overlapping uninterpreted MS/MS spectra into longer chains, and maps these chains to protein sequences directly. With both approaches, combining peptides into proteins is often insufficient for unambiguous identification of the protein form due to a large number of shared peptides. This is particularly true in those cases where the protein sequence database contains many homologous proteins and splice isoforms (e.g. in the analysis of higher eukaryotes), or when the database intentionally includes sequences from multiple organisms [
245].
In early studies, some groups were reporting all proteins identified with at least one non-shared peptide, whereas others reported everything or selected one representative protein among isoforms or homologs [
12]. Many currently used tools present the results in a more transparent format by creating
protein groups. This approach, and the nomenclature for describing various grouping scenarios, is partly based on the parsimony principle or the Occam's razor – “entities must not be multiplied beyond necessity” - which suggests that one should report the smallest number of proteins (protein groups) that can account for all observed peptides [
50]. In this approach, protein database entries that are indistinguishable given the sequences of identified peptides are collapsed into a single protein group. Other scenarios include subset proteins, i.e. proteins that share all of its peptides with another protein that is identified by at least one non-shared peptide, and other more complicated cases [
50]. Such a nomenclature provides a more consistent and concise format for representing the results of shotgun proteomic experiments (for a simple illustration see ).
In certain cases, e.g. for comparison of proteomic and transcriptomic data, or to simplify the visualization of proteomic data in genomic context, it is advantageous to assemble and interpret the data using not a protein but a gene index as a reference. To achieve this, one can map peptides directly to the genome (or utilize protein-to-gene mappings already available for some protein sequence databases), and collapse the protein groups to keep unique gene accession numbers only. More elaborate gene model – protein sequence – protein accession relationships have also been suggested [
246]. Interpretation of the results at the gene level has an additional benefit of providing more conservative protein lists by eliminating erroneous identifications of homologous proteins. One common example of this kind would be a minor isoform of a protein reported as unambiguously identified by a single non-shared peptide which is in fact a false identification, and where the true (highly homologous) peptide sequence belongs to another protein isoforms of the same gene.