|Home | About | Journals | Submit | Contact Us | Français|
When analyzing proteins in complex samples using tandem mass spectrometry of peptides generated by proteolysis, the inference of proteins can be ambiguous, even with well-validated peptides. Unresolved questions include whether to show all possible proteins vs a minimal list, what to do when proteins are inferred ambiguously, and how to quantify peptides that bridge multiple proteins, each with distinguishing evidence. Here we describe IsoformResolver, a peptide-centric protein inference algorithm that clusters proteins in two ways, one based on peptides experimentally identified from MS/MS spectra, and the other based on peptides derived from an in silico digest of the protein database. MS/MS-derived protein groups report minimal list proteins in the context of all possible proteins, without redundantly listing peptides. In silico-derived protein groups pull together functionally related proteins, providing stable identifiers. The peptide-centric grouping strategy used by IsoformResolver allows proteins to be displayed together when they share peptides in common, providing a comprehensive yet concise way to organize protein profiles. It also summarizes information on spectral counts and is especially useful for comparing results from multiple LC–MS/MS experiments. Finally, we examine the relatedness of proteins within IsoformResolver groups and compare its performance to other protein inference software.
An effective method for identifying proteins within complex samples involves multidimensional LC–MS/MS, where proteins are proteolyzed, and peptides are separated by reverse-phase liquid chromatography (RP-LC) and sequenced by mass spectrometry gas phase fragmentation (MS/MS). Automated computer programs are used to analyze the tens of thousands of spectra that can be generated by a single experiment, by matching MS/MS spectra to peptide sequences in protein databases. A significant problem is how to assemble the information contained in large numbers of peptide sequences into a final set of identified proteins.
The task of protein identification is straightforward when peptide sequences are found only within single protein database entries (which we will refer to throughout as “proteins”). However, when a peptide sequence is found in multiple entries, ambiguities arise about which proteins are truly present. This problem is greatest with proteomes where paralogous genes and extensive alternative splicing produce many related proteins within a database.(1) For example, the estimated 20488 distinct genes in the human genome(2) yield 89486 proteins in the International Protein Index (v3.75, Aug. 2010) database,(3) which include splice variants, proteolytically processed proteins, and protein fragments. Our analysis shows that of the 3.8 million fully tryptic peptides from this protein database (allowing ≥8 amino acids and up to 2 missed cleavages), over 2 million are shared between two or more proteins. The prevalence of shared peptides creates a need for computational algorithms that infer the most likely protein assignments, a process called protein inference.(4)
Often protein profiles do not report all possible proteins, but only the minimal list which best accounts for the observed peptides (Table (Table1).1). The manner in which minimal list proteins are selected differs between protein inference programs. DTASelect identifies proteins using a greedy algorithm,(5) and in ambiguous cases, shows all possible proteins, allowing users to manually decide between them. ProteinProphet ranks proteins according to probabilities computed from the number of peptides, confidence in the peptide sequence, and the degree to which peptides are shared between multiple proteins.(6) Proteins that are “indistinguishable” (i.e., represented by a set of identical peptides) are assigned equal probabilities. DBParser also uses a greedy algorithm to rank proteins according to those with the most peptides.(7) Phenyx selects a minimal list of proteins, ranked by the number of peptides identified and the protein sequence coverages,(8) but differs from other programs by reporting only one protein entry and accession number (a representative “anchor” protein), even when two or more proteins are indistinguishable. All of these programs use a “protein-centric” approach of matching peptides directly to protein database entries and reporting peptides within the context of proteins (Figure (Figure11a).
In 2004, we proposed an alternative strategy for protein inference, named IsoformResolver, which generates a list of nonredundant peptide sequences, and then matches each peptide to all protein entries which contain that sequence.(9) Thus, the approach is “peptide-centric” because the observed peptides are directly referenced against a peptide database (Figure (Figure1b).1b). This strategy has the advantage of more readily assessing the ambiguity in matching peptides to proteins that share peptide sequences in common. Peptides are output within the context of all possible proteins from which they can derive.
In this study, we describe the IsoformResolver algorithm in detail for the first time, and demonstrate the advantages of using peptide-centric protein grouping methods to address problems in protein inference for large data sets. We demonstrate that protein inference increases the variability of proteins between similar data sets (“volatility”), and show that protein inference methods yield significant volatility when reporting proteins separately, which is solved by peptide-centric protein grouping. A compare profile feature of IsoformResolver allows results from many protein profiling experiments to be analyzed, by first performing inference across all experiments pooled together and then reporting spectral counts from individual experiments in an easily viewed format. Finally, we compare IsoformResolver against other protein inference programs and show that the most important factor influencing agreement between different programs is how they treat indistinguishable proteins. Advantages of IsoformResolver are: (i) its protein grouping methods, which allow concise display of proteins including all possible candidates, (ii) its ability to display related proteins adjacently in a protein profile and compare proteomics data sets analyzed at different times and using different software, (iii) its facile integration of label-free quantification by spectral counting into protein sets, and (iv) its ability to compare results from multiple large-scale data sets.
LC–MS/MS data sets used in these studies were collected on human melanoma and erythroleukemia cell lines and summarized in Suppl. Table S1 (Supporting Information). Samples were proteolyzed with trypsin as described,9−11 and fractionated by reversed-phase HPLC coupled to an LTQ/Orbitrap mass spectrometer (parent scan 475–1600 m/z). DTA files representing MS/MS spectra were generated using BioWorks XCalibur v.3.0 software and concatenated into MGF files using in-house software. DTA files were searched by Sequest(12) specifying carbamidomethylated cysteine and up to two missed trypsin cleavages. Parent ion tolerance was set to 1.2 Da or 50 ppm (specified in Suppl. Table S1) and fragment ion tolerance to 0.8 Da. MGF files were searched using Mascot v.2.2 (Matrix Science,(13)) using the same parameters, and Mascot results were parsed using the Mascot parser (http://www.matrixscience.com/msparser.html). Decoy versions of databases were constructed by reversing each protein sequence from normal databases, which were then searched separately or as a target-decoy database.14,15 Peptides accepted when scores were above thresholds corresponding to 1% false discovery rate (FDR=FP/(FP+TP)). Peptides were also filtered for physicochemical properties, including peptide size, likely missed cleavages,(16) and mass accuracy (observed minus predicted between −5 ppm and +10 ppm). Peptides were also supported by similarity scoring between observed MS/MS and spectra simulated from peptide fragmentation models17,18 implemented by Manual Analysis Emulator (MAE).(19)
IsoformResolver is a Perl program that uses as input one or more files containing validated peptide spectrum matches and generates a protein profile displaying all identified and inferred proteins (Figure (Figure2).2). For protein information, IsoformResolver accepts any FASTA or EMBL DAT formatted protein databases. Prior to IsoformResolver execution, these protein databases are reformatted into a peptide-centric database, consisting of map files that associate peptides with proteins from which they can be proteolytically derived. This is done once per protein database and requires specifying a protease, number of allowable missed cleavages, and a minimum peptide length. During IsoformResolver execution, validated peptide spectrum matches are input, using the file format shown in Suppl. Figure S1 (Supporting Information). Peptides not found in the peptide-centric database, such as semiproteolytic and nonenzymatic peptides, are searched for within the protein-centric database, and matched to the proteins from which they derive and to the MSD and ISD protein groups to which the proteins belong. Peptides, even semi- and nonproteolytic, are included in all sections of IsoformResolver output and included in spectral counting. Peptide-centric database files have been constructed and tested for use with many proteases including ArgC, LysC, Trypsin, AspN, and can be constructed for any protease with cleavage specificity. In addition, we have constructed and tested peptide-centric database files with combined ArgC + LysC + trypsin cleavages. ISD reformatted datafiles can be constructed from any protein database. The impact of the peptide-centric database will be higher as the number of shared peptides increases. Thus, while ISD protein groups show some benefit using UniProt Sprot, which has a relatively low number of shared peptides, the impact is higher using Sprot/Trembl/Splice variants, a database with an even greater percentage of shared peptides than IPI.
IsoformResolver utilizes two types of protein groups—in silico-derived (ISD) protein groups and MS/MS-derived (MSD) protein groups. ISD groups are constructed using all peptides derived from in silico proteolysis of a protein database. Using the peptide to protein mapping from the peptide-centric database, proteins are then clustered together whenever they have a peptide in common. Resultant ISD groups are assigned group identifiers and the mapping of proteins to these identifiers are stored in a text file for rapid access during IsoformResolver execution. MSD protein groups are constructed in an identical way, but using different sets of input peptides, consisting of sequences identified from the MS/MS and validated by thresholds or other means. The list of all possible proteins for the observed peptides is obtained by matching peptides to the precalculated peptide-to-protein mapping from the reformatted protein database. These proteins are clustered whenever they have an observed peptide in common, and the resultant protein groups are then assigned an MSD group identifier. MSD groups thus contain only peptides and proteins which were observed in the MS/MS experiment, while ISD groups contain peptides and proteins from the entire protein database, even when they were not observed.
Protein inference is performed on each MSD protein group separately, considering each peptide equally plausible by default, although IsoformResolver can also accept peptide weights using scores or probabilities. Proteins are designated as primary through an iterative process, in which a greedy algorithm is used to select the protein which accounts for the largest number of peptides within a MSD group (or the highest combined score or probability), the protein which accounts for the largest number of remaining peptides that do not match the first protein, and so on until no peptides remain. All other proteins (which lack distinguishing peptide evidence) are designated as secondary. Indistinguishable proteins are primary proteins which are identified by shared peptides that cannot distinguish between the proteins and are counted as a single protein in the minimal list, although all protein identifiers are reported.
In addition to the mapping files described above, the peptide-centric database consists of an annotation file which contains information on the relatedness of proteins within each ISD group. Functional relatedness are evaluated: (i) by gene annotation, based on genes (from Entrez Gene, HGNC, Ensembl, VEGA, or H-InvDB), gene clusters (UniGene) or gene location (chromosomal start location and sense/antisense direction), (ii) by protein family, based on InterPro, Pfam, PROSITE, GENE3D, SUPERFAMILY, PANTHER, ProDOM, PRINTS, and TIGRFAMs databases, and (iii) by GO and other annotations found in the DAT format (e.g., RZPD, UTRdb, SMART, CCDS, CleanEx). Each ISD group has a unique identifier, and is annotated to indicate the percentage of proteins in the group with the same gene, protein family, GO, or other annotation.
Comparisons of IsoformResolver to five other protein inference programs used the following versions of software. Analyses with ProteinProphet(6) used Transproteomic Pipeline (TPP) v.3.3.0 (9/25/2007), and v4.3 JETSTREAM rev 0, Build 200908071234 (MinGW) (http://tools.proteomecenter.org/TPP.php), and were performed using the Mascot option, with peptide probability cutoff 0.95 and protein probability cutoff 0.50. Analyses with Scaffold v.01_07_00 (described in (20) and generously provided by Proteome Software) used the combined Mascot and Sequest option, with peptide and protein probability cutoffs of 0.95 and 0.50, respectively. Analysis with Panoramics v.1 (05/2007, described in ref (21)), used the Windows executable provided by the USDA Agricultural Research Service, performed on Mascot search results using protein probability threshold 0.80. IDPicker v.2.0 (described in refs (22) and (23), http://fenchurch.mc.vanderbilt.edu/lab/software.php) used peptide and protein probability cutoffs equal to 0.99. The same Sequest and Mascot results files were used in all analyses, except for IDPicker where data sets were searched using a combined target/decoy database. Analyses with Phenyx Public Server and PhenyxOnline v.2.5 (described in ref (24) and generously made accessible by GeneBio) used the default threshold cutoff (Z-score = 5, p = 0.0001, and AC score = 6).
To compare output between programs, peptides from each program were converted into a common input format, a compare protein profile was created from all peptides generated by the six programs, and the output was annotated with proteins identified by each program. Using IsoformResolver MSD and ISD protein groups, related proteins from each of the profiles were clustered together, simplifying the evaluation in cases where proteins were missed by a profiler or protein variants were identified and allowing for an easy enumeration of primary and secondary proteins.
IsoformResolver precalculates a mapping of all proteins to a list of nonredundant peptides within a given database (Figure (Figure2),2), which identifies all proteins that share peptide sequences. It then generates a protein profile displaying all identified and inferred proteins from one or more files of observed peptides. The peptide-centric algorithm allows two types of protein groups to be generated. In silico-derived (ISD) protein groups are constructed from a protein database, by compiling all peptides derived from in silico proteolysis (Figure (Figure3a).3a). MS/MS-derived (MSD) protein groups are constructed in an identical way but using input peptides identified experimentally from MS/MS data sets (Figure (Figure3b).3b). Proteins are then assigned to the same group whenever they have a peptide in common. For example, in Figure Figure3a,3a, proteins_A, _B, _C and _D share peptides and are therefore within the same ISD group. However, proteins_A and _B and proteins_C and _D belong to two MSD groups because not all peptides shared between these proteins are observed. Because only some of all possible peptides can be detected by MS/MS, MSD protein groups are strict subsets of ISD protein groups.
IsoformResolver creates a comma separated values output file which consists of three sections (Figure (Figure4,4, detailed output in Suppl. Figure S2, Supporting Information). Section 1 displays proteins and peptides within MSD groups, which are in turn listed together within ISD groups. The output catalogues two types of proteins: those that pass Occam’s razor test of being among the smallest number that account for the peptide evidence (“primary” proteins), and those that do not (“secondary” proteins). Thus, proteins that account for the greatest number of peptides within an MSD group, or else have distinguishing peptide evidence, are primary; all others are secondary. This nomenclature simplifies, but is nevertheless compatible with, the six protein inference categories previously described.4,7 Thus, primary proteins include those that are distinct, differentiable, indistinguishable, and proteins identified by shared peptides only when inferred in the minimal list. Secondary proteins include subset, subsumable, and proteins identified by shared peptides only when not inferred in the minimal list.(4) Primary protein identifiers are integral numbers (e.g., 1,2,...) while secondary proteins have alphabetical identifiers (e.g., a,b,...), and common identifiers indicate connectivities between peptides and proteins. For example, in Figure Figure4,4, peptides_a, _b, and _c, which match protein_A(identifier 1), will contain “1” in their identifiers. Peptides_b, _c, which match both protein_A(identifier 1) and protein_B(identifier 2), contain both “1” and “2” in their identifiers. Primary proteins that are indistinguishable are marked with an asterisk, for example, peptide_x matches protein_C and protein_D, each with the identifier “3*”.
IsoformResolver lists MSD groups in descending order of peptide counts, reporting the observed mass and mass error for each MS/MS, and the number of observed charge forms and highest scores for each peptide, in accordance with reporting guidelines.25,26 Results from multiple experiments, each containing one or more LC–MS runs, are displayed in separate columns and easily compared using a “compare profile” feature (see below). Section 2 consists of a paragraph summarizing the number of spectra, peptides, proteins, and protein groups, as well as the number of proteins supported by different numbers of peptides. Section 3 summarizes proteins inferred to be in the minimal list in the same order as Section 1 and is in a format that is useful for further automation in spectral count analyses.10,27,28
Protein inference can be complicated when peptides are shared between multiple protein entries. For example, proteins which are indistinguishable based on the peptide evidence (e.g., proteins_C and _D in Figure Figure4)4) complicate the protein report, because the number of proteins in the minimal list (where only one is counted) differs from the number of primary proteins (where both are counted). Reporting all indistinguishable proteins (protein_C and protein_D) inflates the protein count over the minimal list. Selecting one representative protein (protein_C or protein_D) reports the minimum count accurately, but chooses proteins arbitrarily. Treating a set of indistinguishable proteins as one entity with a concatenated name (e.g., protein_C_D) reports the correct number and retains information about the protein identities, but leads to variations in naming between data sets. Each method reports different protein lists, and each compromises accuracy, especially when comparing results from two or more protein profiles.
Also important are cases where peptides are shared between proteins that are distinguishable by the presence of other peptides. We call these cases “bridge peptides” (Our use of the term “bridge peptides” is similar but not identical to the term “razor peptides” (ref (29)). The latter refer to peptides which are, by Occam’s principle, assigned to the nonoverlapping protein group with the greatest number of peptides. By contrast, bridge peptides are assigned to protein groups which allow overlapping proteins, to retain information that the peptide is shared.), which are shared between primary proteins, and are more problematic than peptides which are shared between primary and secondary proteins. This is because when bridge peptides are encountered by protein-centric inference programs, they are either eliminated from all but one group, or else duplicated and assigned redundantly to different protein groups. An example is shown in the report of two primary proteins, where bridge peptides are replicated and comprise 70% of the peptides for each protein (Suppl. Figure S3a, Supporting Information). Because each protein is listed separately in the output, the replicated peptides may lead to overconfidence in the protein identifications.
Bridge peptides and indistinguishable proteins are a significant problem in protein profiling. For example, in Data set 1A (Suppl. Table S1, Supporting Information), 15% of the 3667 minimal list proteins were linked to others through bridge peptides, 40% were indistinguishable, and only 25% were distinct. Of the 26225 nonredundant peptides, 67% matched two or more proteins, 7% were bridge peptides, and only 33% matched a single protein entry. Thus, underlying the ambiguity in protein identifications is the fact that the shared and bridge peptides are a considerable fraction of total peptides and affect a high percentage of proteins.
These problems are addressed by IsoformResolver’s report format, which lists proteins with shared peptides together, within the context of MSD protein groups. Because primary proteins are displayed adjacently when they share peptides, the need to duplicate bridge peptides and redundantly assign them to different proteins is eliminated (e.g., Suppl. Figure S3b, Supporting Information). By displaying all possible proteins, MSD groups allow a user to immediately view the support for inferred proteins as well as alternative but equally likely candidates (Suppl. Figure S2, Suppl. Worksheet:1.xlsx, Supporting Information). The nomenclature used for the MSD identifiers allows the different classifications of distinguishable, indistinguishable, subset, and subsumed proteins to be readily assessed.
Problems also arise when protein identifications are easily altered by minor changes in observed peptides, which we refer to as “volatility”. Volatility reflects a nonrobust quality of protein inference. Suppl. Figure S4 (Supporting Information) shows an example of assigning peptides to proteins using a greedy algorithm, where two proteins are inferred as primary (IPI00181997.7 and IPI00479677.3), and five proteins are secondary (IPI00376351.2, IPI00383202.1, IPI00744506.1, IPI00785128.2, IPI00797783.1). However, in two equally plausible alternative solutions, IPI00181997.7 and IPI00376351.2 or IPI00376351.2 and IPI00479677.3 could be assigned as the primary proteins. Here, small changes in observed peptides will affect which proteins are deemed primary. For example, if peptide GSL... had not been observed, then IPI00181997.7 would have been inferred as the only primary protein accounting for all peptides, and IPI00479677.3 and IPI00376351.2 would have been called secondary. No method of protein inference obviates this problem, including those which are probability-based, or those which ignore proteins supported by a single peptide.
To quantify the effects of protein inference on volatility, we examined the repeatability of proteins identified in different data sets, collected at similar depth or varying depth of sampling. First, we quantified the degree to which proteins were repeated between three technical replicate data sets (Suppl. Table S1, Data set 2, Supporting Information), where peptides identified in any data set varied due to random sampling by LC–MS/MS. On average each data set yielded 2922 ± 83 nonredundant peptides (Table (Table2),2), 71% of which were found in at least two data sets and 48% which were identical across all three data sets. We then examined all, primary, concatenated, and representative proteins, evaluating their overlap between replicates. As expected, the overlap between replicates was generally higher for proteins than peptides, because each protein was represented by 2.8 peptides, on average. However, we found that the degree of overlap varied with each reporting method (Table (Table2),2), due to their differences in how they dealt with indistinguishable proteins.
The overlap was highest when all possible proteins were compared (82% between two or more replicates, 64% between three replicates, Table Table2a),2a), because none were removed by inference. In contrast, primary proteins, which listed indistinguishable proteins as separate entities and removed secondary proteins, showed decreased overlap between two replicates (74%) or three replicates (55%), and tended to select for splice variants and proteins that shared many peptides. Concatenated protein identifiers reduced overlap even further (70% between two replicates; 48% between three replicates). Here, indistinguishable proteins were named by concatenated identifiers, which often overlooked proteins present in common between data sets (e.g., an identifier ProteinA_ProteinB would fail to match ProteinB_ProteinC in a different data set, although ProteinB was common to both). Representative proteins increased their overlap between replicates, because proteins with the lowest accession number were chosen from among indistinguishable proteins, while information about other possible proteins was discarded.
Thus, methods which enumerated the most likely proteins (primary and concatenated) paradoxically led to the lowest protein repeatability. Similar trends were observed with proteins identified by two or more peptides (Table (Table2),2), indicating that the effect was not caused by peptide sampling variations or low confidence protein identifications. We hypothesized that the effects were instead due to problems introduced by protein inference.
To test this, we constructed a protein profile using a data set which pooled the three replicate data sets together (using a two peptide minimum), then annotated the results by those proteins inferred when each data set was analyzed separately. The minimal list for the pooled data set contained 760 proteins, of which 75 proteins were supported by peptides present in only one or two of the replicates (Table (Table2,2, Figure Figure5a).5a). Thus 685 (90%) of all of proteins were found in common between replicates, far higher than the degree of overlap observed when proteins were inferred from the three data sets independently, regardless of reporting method. Nevertheless, only 377 (55%) of the 685 proteins were inferred in all three replicates (89 distinct proteins, 288 in the same MSD groups), while 308 (45%) proteins differed between replicates. Therefore, the low repeatability across replicate sets was mainly due to variability in the proteins inferred from peptides present in all three sets. In 198 of the 308 cases, the same proteins would have been identified in each data set, but were removed because they were identified by fewer than two peptides (e.g., illustrated in Figure Figure5b).5b). In the remaining 110 (16%) cases, differences between data sets were due to different protein identifications, and thus caused by parsimony.
Cases where proteins varied due to parsimony reflected volatility due to small changes in additional distinguishable peptides that were present in some, but not all replicates. For example, in Figure Figure5c,5c, the presence of peptide EH... in Replicate 3 but not Replicates 1 and 2, led to inference of only one primary protein in Replicate 3, whereas three indistinguishable proteins were inferred in Replicate 1 and two indistinguishable proteins were inferred in Replicate 2. Overall, the inferred proteins showed greater differences between replicates than the peptides. These results showed that protein variations are an intrinsic feature of shotgun proteomics, not only due to variations in peptide sampling, but also because variable protein identifications are exacerbated by inference.
We next examined the replicate data sets using ISD protein groups. When each of the three replicate data sets were analyzed separately, 1109 or 626 ISD groups were respectively identified after requiring ≥1 or ≥2 peptides/protein (Tables (Tables2).2). The overlap in ISD groups between 2 and 3 replicate data sets were 81 and 62–65%, respectively, comparable to the overlap between all possible proteins, and significantly higher than the overlap between inferred proteins, regardless of reporting method. Thus, ISD groups allow greater overlap to compare proteins between data sets, and therefore offers a more stable view of the protein profile.
Next we examined effects of protein inference on volatility by comparing data sets collected at different depths of sampling, comparing data sets of cell lysate proteins analyzed in duplicate 1D-LC-MS/MS runs (29,907 MS/MS) vs proteins separated by SDS-PAGE followed by in-gel digestion (252,205 MS/MS) (Suppl. Table S1, Data set 3, Supporting Information). Prior studies had shown that proteins identified in data sets at lower sampling depth overlap nearly completely with those in data sets collected at higher depth.(9) Thus as expected, the overlap was high, where 91% of peptides and 98% of proteins identified in the lower sampling depth data set were also identified in the higher depth data set (Table (Table3a).3a). However, the overlap between primary proteins was only 75%.
To confirm that this variability was due to inference and not to differences in peptides between the peptides contained in each data set, we simulated a lower depth data set by truncating MS/MS spectra with lowest intensity from the higher depth Data set 3. The MS/MS removed were adjusted to yield a remaining number of peptides similar to that of the lower depth experimental data set (Table (Table3).3). Because peptides in the truncated data set were a complete subset of those in the high depth data set, any protein variations would reveal effects due only to inference. The results showed that even when the peptides in the low depth data set overlapped those in the high depth data set completely, protein inference decreased the overlap between primary proteins by 21%.
By contrast, ISD groups showed 98% overlap between data sets collected at lower and higher sampling depth and retained 100% overlap between the simulated and higher depth data sets. Thus, the mapping of proteins and peptides to invariant ISD groups added stability to the protein report, bypassing problems in reproducibility, and thereby counteracting volatility caused by protein inference.
We found that protein inference varied when data sets were joined in different ways. Often, proteomics experiments involve comparisons between LC–MS/MS runs (e.g., control vs treated, differing protocols, chromatographically separated proteins). The many data sets produced can be analyzed either carrying out protein inference on each data set separately and then combining the results to create an aggregate set (“aggregate” analysis), or by pooling peptides from all data sets together before protein inference (“pooled” analysis) (Figure (Figure66a).
In order to compare the two approaches, data sets were collected on cell lysate proteins that were first separated into 33 fractions by strong anion exchange (SAX) chromatography, followed by proteolysis and LC–MS/MS (Suppl. Table S1, Data set 1B, Supporting Information). In a first test, proteins were assembled from data sets of each fraction analyzed separately by IsoformResolver, which were then joined into an aggregate profile of 7699 primary (distinct + distinguishable + total indistinguishable) proteins and 4582 minimal list (distinct + distinguishable + minimal indistinguishable) proteins, where the counting excluded redundant cases. In a second test, peptides from each SAX fraction were combined into one pooled data set and then assembled into proteins using IsoformResolver, yielding 5854 primary and 3270 minimal list proteins. Thus, the number of minimal list proteins inferred in the pooled profile was 40% lower than those inferred in the aggregate profile. The protein overlap was nearly complete, as only one primary protein observed in the pooled analysis was excluded from the aggregate analysis. Therefore, with multiple LC–MS/MS runs, pooling the peptide information before assembly yielded a more conservative protein count.
How protein inference underlies this effect is illustrated in an example where 6 observed peptides mapped to 6 possible proteins (Suppl. Figure S5, Supporting Information). In the pooled analysis, two primary proteins (IPI00444788.1 and IPI00025340.3) accounted for all peptides (Suppl. Figure S5a, Supporting Information). However, in the aggregate analysis, the number of peptides in each fraction varied, and together inferred six primary proteins, five of which were distributed in three indistinguishable sets (Suppl. Figure S5b, Supporting Information). For example, peptides in fraction #22 identified four indistinguishable proteins (IPI00444788.1, IPI00445123.1, IPI00456744.1 and IPI00743804.1), while peptides appearing in fraction #23 identified two indistinguishable proteins (IPI00444788.1 and IPI00456744.1). Thus, even when the same peptides were represented, carrying out protein inference on separate data sets inflated the protein counts compared to pooling the data sets prior to inference. Such differences were caused by lower numbers of peptides in each fraction in the aggregate analysis, leading to increased numbers of indistinguishable proteins. In the pooled analysis, more proteins were converted to distinguishable or secondary proteins, reducing the indistinguishable proteins and minimizing the number of primary proteins.
Despite this advantage, pooling data sets discarded important information about the representation of different proteins across samples. For example, when chromatographically separating proteins, it is often useful to know how different proteins vary in elution, and here, it would be advantageous to analyze each data set separately. Therefore, IsoformResolver provides the option of displaying a “compare profile” in Section 3 (Figure (Figure6b),6b), in which primary proteins are inferred and spectral counts apportioned using the pooled data sets, while spectral counts are displayed per individual data set.
An example of a compare profile is shown in Suppl. Figure S5c (Supporting Information), where the pooled analysis inferred two primary proteins, and displaying each fraction separately in the output clearly showed that the two proteins resolved chromatographically. Peptides in fractions #15–17 best matched protein_764, while peptides in fractions #22–24 best matched protein_763 or secondary protein_b. In fact, in fractions #22–24, support for protein_b over protein_763 was suggested by the absence of peptide LEE... against the presence of peptides LSE..., SLS..., SPP... and KLP.... This illustrates the advantage of combining the peptide evidence with information about chromatographic resolution, allowing the user to evaluate cases that might otherwise have been overlooked. By calculating the most conservative estimate of minimal list proteins and displaying related proteins in logical groupings, IsoformResolver allows spectral count variations between individual data sets to be readily evaluated. Thus, the compare profile feature of IsoformResolver combines the strengths of pooled and aggregate analyses, by providing a conservative calculation of proteins from a pooled analysis and an informative display of results in each experiment.
An important approach for label-free quantification of proteins is spectral counting, which sums the total number of MS/MS corresponding to any peptide in a given protein.(27) However, assigning spectral counts to proteins is complicated when bridge peptides are shared between two or more proteins in the minimal list.30,31 This can skew information on relative abundances of proteins. For example, in Figure Figure7a,7a, a,22 peptides (EAG..., NHP...) uniquely infer two indistinguishable proteins (GNPDA2) with 9 spectral counts, and 5 peptides (AAG..., DHP..., FFD..., LII..., and LVD...) uniquely infer one protein (GNPDA1) with 37 spectral counts. Four bridge peptides (AIE..., EVM..., TFN..., VPT...) represent an additional 45 spectral counts, and how these are apportioned can greatly influence the estimated relative abundance of GNPDA1 and GNPDA2. IsoformResolver apportions spectral counts from bridge peptides proportionally to the spectral counts of nonshared peptides for distinguishable proteins. In this example, 20% of spectral counts from bridge peptides were apportioned to GNPDA2 and 80% were apportioned to GNPDA1 (Figure (Figure7b).7b). Similar calculations are used to apportion nonredundant peptides. Apportioned spectral counts for bridge and nonredundant peptides are then summarized in Section 3 of the IsoformResolver output (Figure (Figure4,4, Suppl. Figure S2, Supporting Information). We report spectral counts for distinguishable and bridge peptides separately, as the primary evidence for each protein. Apportionment of spectral counts according the number of distinguishable peptides is also included which can be useful for comparing proteins containing bridge peptides with those that do not.30,31
Figure Figure7c7c shows examples which break down spectral counts according to SAX fractions, and illustrate how spectral counts for nonbridge vs bridge peptides can provide information about the reliability of protein identifications and the presence of related proteins. Case [i] shows a simple example, where bridge peptides track two proteins (1009*, 1010) in each of fractions #19–22, and support the presence of each protein. Case [ii] shows bridge peptides which match two primary proteins (1065, 1066*) but track only one protein (1065). In Case [iii], some bridge peptides appear in fractions #33–39 but track neither primary protein (363or364), suggesting that they instead correspond to another protein. Because IsoformResolver reports detailed information about all proteins and their spectral counts, such cases can be readily assessed and overlooked proteins identified.
A unique feature of IsoformResolver is that it clusters the display of proteins based on shared peptides, allowing proteins related by bridge peptides and belonging to the same MSD and ISD groups to be listed adjacently. This solves problems caused by listing proteins separately, which may lead to overconfidence in protein identifications. For example, Figure Figure7c,7c, Case [iv] shows two paralogous proteins from different genes which differ widely in spectral counts (81 for 121*, 1 for 122*). Redundantly assigning the 18 bridge peptides to both proteins might create false confidence for the presence of protein 122*, especially if the proteins were reported in different regions of the output. By displaying these proteins adjacently in the output, potential false positive peptide assignments (e.g., with disproportionately few spectral counts) and the apportionment of bridge peptides are readily evaluated. In addition, clustering proteins by ISD groups allows related proteins to be easily identified. For example, in Case [v], proteins 1470* and 1290 are paralogs that share amino acid sequences, but no bridge peptides were observed and the peptides for proteins 1470* and 1290 were nonoverlapping. Here, protein-centric methods would have placed each protein in separate groups, and the fact that these genes are related would have been missed. The ability of IsoformResolver to display ISD groups adjacently allowed these related gene products to be listed together, facilitating evaluation of their relative abundance by spectral counting.
We evaluated whether ISD groups might contain proteins that share biological function as well as peptide sequence. Functional relatedness was evaluated in multiprotein ISD groups (i.e., with two or more proteins), scoring agreement between IPI UniProt (DAT) and GO database annotations, and requiring one or more annotation to be shared in common among all proteins within an ISD group. We assessed first whether proteins within each group were derived from common genes; second, whether they were members of a common protein family, although not derived from a common gene; and third, whether they were functionally related by GO or other annotations, although not a common protein family.
Of the 10651 multiprotein ISD groups generated from shared peptides of 8 amino acids or longer, 7136 (67%) contained protein members all derived from a common gene (e.g., splice variants, processed protein forms), 1683 more (16%) contained members all belonging to a common protein family, and 538 more (5%) contained protein members sharing GO or another cross-reference annotations (Figure (Figure8a).8a). Another 929 (9%) contained members with incomplete annotations; however, the proteins that were annotated showed complete agreement in gene, protein family, or other annotations. Thus in 97% of ISD groups, all protein members that could be evaluated were functionally related. In the remaining 3%, proteins often appeared related. For example, one group contained proteins with similar gene names (CNNM1, CNNM2, CNNM3, and CNNM4, corresponding to cyclins M1–M4), even though their annotations were nonoverlapping. Because gene names do not always report function, this group was not scored, although its members were clearly related.
We also examined the frequency with which proteins between different ISD groups were unrelated. Here, we scored “exclusivity”, when a group was the only one which corresponded to a particular gene, protein family, or other cross-reference identifier. Among the 7136 ISD groups whose protein members unanimously specified a single gene annotation, 7041 (99%) were exclusive. Not surprisingly, protein family annotations did not show the same degree of exclusivity. Among 5868 groups whose proteins unanimously specified a common protein family annotation (4185 also specifying a common gene), only 1410 cases were exclusive. The results show that proteins that share even few peptides in common are related functionally, and that for the most part, ISD groupings capture all proteins which are related, while excluding proteins which are unrelated.
This behavior changed with the length of shared peptides. Protein groups constructed from shared peptides with minimum length 5, 6, or 7 amino acids produced fewer protein groups, each with higher average numbers of proteins (Figure (Figure8b).8b). On the other hand, as peptide length and the number of groups increased, the relatedness of proteins within each group also increased (Figure (Figure8c).8c). Considering only gene annotations, increased peptide length led to increased consensus, while exclusivity remained constant (data not shown). A minimum length of 5 amino acids yielded large ISD groups, averaging 98 protein entries, whose proteins exhibited functional relatedness within 84% of groups. A minimum length of 12 amino acids yielded more ISD groups, with little change in functional relatedness compared to 8 amino acids. Overall, 8 amino acids was the optimal minimum length for grouping proteins with common function. This was the minimum length previously determined for filtering out false positives during peptide identification.9,14 We conclude that 8 amino acids provide an optimal minimum peptide length for protein grouping as well as peptide identification.
We compared IsoformResolver to other programs used for protein inference (IDPicker, Panoramics, Phenyx, Scaffold, TPP ProteinProphet). The programs varied with respect to input/output format, ease of use, and other features (summarized in Suppl. Table S2, Supporting Information). Here, we focused on their differences with respect to protein inference, protein grouping, how they dealt with indistinguishable proteins, their ability to handle large data sets, and comparison of results between different data sets.
We first compared software with respect to protein inference on a single LC–MS/MS run (Suppl. Table S1, Data set 1D, Supporting Information). The numbers of peptides and proteins reported by each program were comparable and default parameters were used in each case, with settings chosen to yield comparable numbers of identified peptides. One complication was that Phenyx, Scaffold, and ProteinProphet integrate peptide identification algorithms into the software, each using different underlying methods to choose peptides, assess false assignments, and evaluate low scoring MS/MS spectra. This introduced variations in identified peptides, which complicated the comparison of protein identifications. Therefore, IsoformResolver was used to specify ISD groups from the peptides identified by each program. In this way, we could assess proteins identified by each program that were within the same ISD group, allowing differences in protein inference rather than differences in peptides to be evaluated.
Each program yielded proteins corresponding to 255–295 ISD groups. Of the 238 groups common to all six programs, 60 contained proteins that were distinct and unambiguously identified by all. In order to minimize differences due to peptide variations, 112 of the remaining 178 ISD groups were selected because all programs mapped identical proteins to the peptides in these groups (termed “meta-peptides” by ref (22)). We inspected and compared proteins inferred for these 112 ISD groups.
Certain programs showed greater similarities in their protein identifications. Programs that reported all indistinguishable proteins as primary (TPP ProteinProphet, Panoramics, Scaffold, and IsoformResolver in its default mode) showed greater similarities in protein identification with each other, compared to programs which selected a single, representative protein from among each indistinguishable set (IsoformResolver in its representative protein selecting mode, Phenyx, and IDPicker). We identified five different cases. In 20 of 112 ISD groups (Case 1), the same proteins were identified by all 6 programs (Figure (Figure9,9, see Suppl. Table S3 and Suppl. Worksheet:3.xlsx for the entire analysis, Supporting Information). In 34 ISD groups (Case 2), identical proteins were inferred by programs which selected and displayed all primary proteins, and by programs which displayed only representative proteins, although the proteins differed between the two program types. In 53 ISD groups (Case 3), proteins were identical among programs that displayed primary proteins, but nonidentical among programs that selected representative proteins. In 3 ISD groups (Case 4), the proteins were nonidentical among programs displaying primary proteins but identical among those selecting representative proteins. The remaining 2 ISD groups (Case 5) showed no agreement in proteins identified between the two kinds of programs. Thus, agreement was generally found between programs that selected primary proteins, while programs that selected representative proteins often disagreed with each other, and sometimes chose proteins that none of the other programs inferred. Similarly, analysis of the Sigma-Aldrich UPS1 sample of purified human proteins, where true and false protein identifications could be determined, showed that programs reporting primary proteins yielded more true assignments than programs reporting representative proteins (data not shown). We conclude that reporting primary proteins yields greater agreement after protein inference, whereas representative proteins, while convenient for simplifying output, loses important information.
An important difference between these programs was how they displayed bridge peptides. Phenyx, Panoramics, Scaffold, and ProteinProphet replicated bridge peptides, listing them redundantly with proteins that shared them. IDPicker dealt with bridge peptides by assigning them to only one protein and discarding them from others. When ProteinProphet and IsoformResolver profiles were compared (Suppl. Table S1, Data set 1C, Supporting Information), 777 MSD groups were found in common by both programs. ProteinProphet displayed secondary (subset) proteins within its protein groups (e.g., as in Suppl. Figure S3a, Supporting Information), but separated protein groups that shared bridge peptides. By contrast, IsoformResolver listed each peptide together within their MSD group, therefore bridge peptides were neither overrepresented nor underrepresented (Suppl. Figure S3b, Supporting Information), and reported the MSD groups adjacently in the output. By separating protein groups that shared bridge peptides, 17 of the 777 MSD groups in IsoformResolver were displayed as 34 protein groups in ProteinProphet, where members of each pair of related protein groups were separated far from each other in the output. This illustrates the advantage of a display that positions related proteins adjacently, in a manner that avoids peptide replication and redundancy.
Finally we examined the ability of each program to compare results from two or more data sets. IsoformResolver, Scaffold, Phenyx, and IDPicker were each able to display differences between multiple data sets within a single protein inference profile. Scaffold and Phenyx only allowed comparison of individual LC–MS/MS runs, while IsoformResolver and IDPicker allowed for any number of LC–MS/MS data sets (Suppl. Table S2, Supporting Information).
We also examined the ability of each program to compare results from separate protein inference analyses, for example, data sets analyzed at different times and then compared retrospectively. All programs allowed primary proteins to be manually compared between separate analyses; however, differences in protein inference and shortcomings of protein reporting led to overestimates of variation between analyses. This was alleviated by reporting protein groups, as allowed by IsoformResolver and IDPicker. However, IDPicker identified protein groups sequentially per profile, preventing their comparison against protein groups from other protein profiles. Only IsoformResolver had a stable (ISD) numbering scheme that allowed uniform comparisons between different experiments.
In this study, we describe IsoformResolver in detail for the first time. We demonstrate that protein inference exacerbates volatility in protein identifications, such that small changes in peptides lead to greater changes in the inferred proteins. We show that protein inference causes significant protein variation introduced by LC–MS/MS sampling in technical replicates, and even when peptides are completely overlapping between full data sets and simulated subsets. When many data sets are compared, protein repeatability is improved by pooling data sets at the peptide level and performing inference once, instead of performing inference on each experiment and aggregating the results. However, the pooled analysis loses important information gained by analyzing each experiment individually.
Underlying the problem of protein volatility is the question of how to select between indistinguishable proteins, inferred as present but not distinguishable from other equally possible candidates. Indistinguishable proteins must be counted singly and yet must be linked to multiple protein identifiers, because reporting all proteins in an indistinguishable set overestimates their presence, but reporting only one of several proteins loses valuable information. No single method of reporting protein identifiers—listing all proteins, primary proteins, concatenated identifiers, or representative proteins—completely solves the problem of underrepresenting or overrepresenting proteins in the sample due to protein inference.
Another important question is how to treat peptides that bridge multiple primary proteins. The results can be misleading when protein inference programs either assign the bridge peptides to only one protein arbitrarily, or else replicate the peptides and match them redundantly to multiple proteins, which may underestimate or overestimate the peptide evidence for a protein. We find that in complex protein databases like the human proteome, the number of bridge peptides increases as more peptides are identified with higher depth (e.g., see Data set 1 in Suppl. Table S1, Supporting Information).
IsoformResolver addresses all of these problems by reporting proteins and peptides in the context of MSD and ISD groups, developed using a peptide-centric strategy which lists each peptide once, and matching each observed peptide to all proteins that share its sequence. In this way, primary, secondary, and indistinguishable proteins can be immediately assessed by the presence or absence of distinguishing peptides, and are clearly marked in the output. Displaying proteins in the context of MSD groups avoids the problems of listing peptides redundantly or arbitrarily assigning them to one primary protein. Displaying primary, indistinguishable, and secondary proteins adjacently avoids loss of information about their relatedness, and allows the experimentalist, not the software, to decide which proteins are most likely present.
By displaying MSD groups adjacently and linked by ISD groups, all proteins linked by shared peptides can be listed together, even when the peptides are not observed experimentally. We show that proteins within ISD groups are usually derived from the same gene or products of gene duplication, exhibiting functional relationships which reflect their underlying sequence identity. Importantly, experimentally observed peptides and proteins can be mapped to protein identifiers which are invariant for a given database, lending stability to the protein profiles by allowing comparisons to be made between experiments analyzed at different times and using different software. ISD groups also allow IsoformResolver to facilitate comparison between data sets by spectral counting, by allowing related proteins to be listed adjacently.
In summary, protein inference remains a challenging problem, but the approach used by IsoformResolver, of converting a protein database into a peptide-centric format in which all nonredundant peptides are premapped to proteins, and all proteins are mapped to ISD groups, helps counteract many ambiguities introduced by the inference problem. In addition, when large data sets are involved, or many data sets must be compared, the algorithms employed by IsoformResolver allow greatly increased speed in execution time compared to other software. Presenting protein and peptide results in the context of MSD and ISD groups is a logical, complete, and concise way to display proteomics information, which solves problems in comparing data sets of high complexity from shotgun proteomics.
Software and peptide-centric database files are available upon request.
We are indebted to Chia-Yu Yen, John Prince, Brian Searle, David Tabb, and Richard Johnson for valuable discussions. Funding was provided by an NRSA Molecular Biophysics Predoctoral Training Grant GM065103 (K.M.-A.), NIH R01 CA118972 (N.G.A.), and NIH R01 CA126240 (K.A.R.).
§ Brian Eichelberger, Dept. of Chemistry, John Brown University, Siloam Springs, AR 72761
Deceased January 8, 2009
Supplemental figures and table. This material is available free of charge via the Internet at http://pubs.acs.org.