|Home | About | Journals | Submit | Contact Us | Français|
A new method for enhancing peptide ion identification in proteomics analyses using ion mobility data is presented. Ideally, direct comparisons of experimental drift times (tD) with a standard mobility database could be used to rank candidate peptide sequence assignments. Such a database would represent only a fraction of sequences in protein databases and significant difficulties associated with the verification of data for constituent peptide ions would exist. A method that employs intrinsic amino acid size parameters to obtain ion mobility predictions that can be used to rank candidate peptide ion assignments is proposed. Intrinsic amino acid size parameters have been determined for doubly-charged peptide ions from an annotated yeast proteome. Predictions of ion mobilities using the intrinsic size parameters are more accurate than those obtained from a polynomial fit to tD versus molecular weight data. More than a two-fold improvement in prediction accuracy has been observed for a group of arginine-terminated peptide ions twelve residues in length. The use of this predictive enhancement as a means to aid peptide ion identification is discussed and a simple peptide ion scoring scheme is presented.
Since the inception of methods to identify peptide ions by tandem mass spectrometry (MS/MS) techniques,1–3 there has been a rapid advance in mass spectrometric instrumentation development. These advances are in large part spurred by the need to increase the overall numbers of identified peptides and proteins in proteomics experiments in order to provide the necessary increased protein complement coverage for accurate and relevant comparative analyses. Over the last 15 years, improvements in mass spectrometry (MS) instrumentation have resulted in increased numbers of assigned peptide ions obtained from liquid chromatography (LC)-MS/MS experiments for complex proteomics samples;4–6 in the characterization of human plasma digests,7–12 numbers of assigned peptide ions in a given experiment have increased by nearly 2 orders of magnitude over this time period.
Although improvements in instrumentation sensitivity and speed have enabled increased numbers of peptides to be identified, a problem of false identification has persisted. The problem is so pervasive in the field that there has been a push to standardize proteomics reporting consisting of the establishment of guidelines for disclosure of statistical analyses used to establish the accuracy of assignments.13 In part, instrumentation improvements lead to the intransigence of the false identification problem as lower-signal species move into identification range with increased analytical performance capabilities. Typically such species produce lower-quality spectra leading to suspect assignments. There is a constant need to develop methods to improve the confidence of peptide ion assignments.
To dramatically improve the accuracy of assignments in proteomics studies, the measurement of new characteristics attributable to dataset features is required. As an example consider the enabling effect of MS/MS experiments. Whereas, the precursor ion mass is insufficient to allow identification of peptide ions in complex proteomics samples, the addition of MS/MS information allows accurate assignments in many cases. A question that arises is how will the new distinguishing characteristics be produced? Some advocate chemometric approaches to elucidate distinguishing characteristics buried in proteomics datasets. For example, ongoing work consists of efforts to predict ion fragmentation distributions (including ion intensities)14–25 as well as LC retention26–31 in order to provide increased identification accuracy. Finally, improved separations of dataset components can be used to enhance peptide ion assignments. Examples include the use of increased mass accuracy permitting more stringent mass matching thresholds for protein database searches32–34 as well as precursor and fragment ion intensity matching that includes the use of LC retention time profiles35,36.
The work presented here describes the use of an additional precursor ion trait –ion mobility– to evaluate peptide ion assignments. Specifically, the use of mobility data obtained from LC-MS/MS analyses of the yeast proteome is evaluated as a means for improving peptide ion identification. Briefly, similar to ion mobility spectrometry (IMS) experiments performed previously,37–40 peptide ion composition is related to measured ion mobilities in order to determine the general effect that the presence of specific amino acid residues have on the overall mobilities of database ions. Upon establishing this relationship for groups of peptide ions, the ability to match drift times (tD) with peptide ions based solely on amino acid composition has been evaluated. A simple peptide ion identification scoring scheme for data that can be produced on current commercial instrumentation (Synapt HDMS, Waters) is discussed. Finally, it is noted that this work is related to that attempting to predict tDs of peptide ions using artificial neural networks (ANNs).41
Data from the analysis of a yeast proteome was provided by Waters Corporation. IMS techniques,42–46 instrumentation,47–54 and theory,55–59 as well as the combination of LC with IMS-MS instrumentation60–65 have been discussed elsewhere. Here only a brief description of methods related to the collection of the tryptic digest data is presented.
800 ng of a tryptic digest of S. cerevisae was injected onto a Trapping and Nanoscale column configuration using a nanoACQUITY (Waters) UPLC system. Peptides were separated on the UPLC prior to being electrosprayed into the entrance orifice of the Synapt HDMS (Waters) instrument. Peptide ions were stored in the Trap Travelling Wave (T-Wave) located at the front of the IMS (Ion Mobility Separation) T-Wave device. Periodically, ion packets from the Trap T-Wave were pulsed into the IMS T-Wave cell where ions were separated due to their mobilities through a buffer gas (N2 for these experiments) under the influence of a drift voltage that is rapidly transmitted along adjacent electrostatic lenses in the IMS T-Wave cell. The repetition of this voltage transmission (wave) provides periodic separation of ions according to their mobilities. Most ions have mobilities that are lower than the transmission rate of the T-Wave voltage causing them to “roll” back and be separated in subsequent waves. The residence times in the T-Wave cell can be calibrated to ion mobilities and thus to collision cross sections. After exiting the IMS T-Wave cell, ions are transmitted through a Transfer T-Wave collision cell into a time-of-flight (TOF) MS device for mass analysis. The collision energy of the Transfer T-Wave is increased on an alternate scan basis producing approximately 10 low energy and 10 high energy spectra across each chromatographic peak.
Yeast strain W303 (MATa ura3-52 leu2-3 leu2-112 trp1-1 ade2-1 his3-11 can1-100) was grown at 30 °C to exponential phase (A600 = 0.8) in rich YEPD medium (2% w/v glucose, 2% w/v bactopeptone, 1% w/v yeast extract). Cells were harvested by centrifugation and washed with water to remove any traces of growth medium. Cells were resuspended in ice-cold water and broken with glass beads using a Minibead beater (Biospec Products, Bartlesville, OK) for 40 s at 4 °C. Cell debris was pelleted in a microcentrifuge for 15 min (13,000 rpm; 4 °C) and supernatants collected for further analysis.
400 μg of protein was suspended in 44 μL of 50 mM ammonium bicarbonate solution containing 0.1% Rapigest (Waters Corporation) and heated at 80 °C for 15 minutes. Dissulfide bonds were reduced by addition of DTT (5 mM) and incubation at 60 °C for ½ an hour. Protein samples were then alkylated with addition of iodoacetamide (10 mM) and incubation at 23 °C for 1 hour in the dark. Trypsin (1:50 trypsin:protein) was added to the protein solution and the sample was incubated for 16 hours at 37 °C. Rapigest was then removed by adding TFA to a final concentration of 0.5%, incubating at 37 °C for 45 minutes and spinning down at 13000 rpm for 20 minutes.
800ng of the tryptic sample was loaded onto a 180 μm × 20 mm Trapping column and washed with 30 column volumes of solvent A (99.9% H2O, 0.1% Formic acid). Peptides are separated on this column and a 75 μm × 200 mm using 1.3, 0.7 and 0.44% per minute gradient increases in solvent B (99.9% ACN, 0.1% formic acid) starting from an initial mixture of 99:1 solvent A:solvent B. A total separation time of 60, 90 and 120 minutes was used for the LC separation resulting in a total experimental time of 180, 270 and 360 minutes for the replicate runs. A flow rate of 300 nL·min−1 is used to perform the LC separation and the eluent is directed into a capillary ESI tip for direct electrospray into the mass spectrometer.
To perform the mobility separation, the IMS T-Wave height is set to 40 V during transmission. The wave velocity was set at 600 m/s. These settings resulted in a total separation time of 13.7 ms. Nitrogen gas pressure in the IMS T-Wave was maintained at 3.27 mBar. The TOF mass spectrometer was operated in “V” mode with a resolving power of >2×104 FWHM and a mass accuracy of 3 ppm RMS. MS/MS experiments were performed using the IdentityE mode.66 Here conditions in the Transfer T-Wave located behind the IMS T-Wave cell are alternated between those that favor transmission of precursor ions (Collision Energy 0 V) and those that induce precursor ion dissociation (Collision Energy ramped from 19 to 45 V). Product ions produced under these conditions have the same chromatographic retention time and the same ion mobility as their precursor. Precursor and product ion mass spectra were acquired over the mass range 50 to 2000 amu with an acquisition rate of 0.9 s per spectrum. A total of 10,000 MS/MS spectra were generated and subjected to protein database searches using the Waters ProteinLynx Global Server (PLGS) and IdentityE software suite.
Ion mobility enhanced MSE spectra were submitted to the PLGS software suite for protein database searches. Mass tolerances used for database searches were 5 ppm and 10 ppm for precursor and product ions, respectively. At least two unique peptides of greater than a 95% probability were required for a protein to be reported. A forward/reverse protein database search strategy was implemented to limit the number of proteins reported. For these datasets utilized in this study the protein false discovery rate was set to 1%.
To provide the best estimation of intrinsic amino acid size parameters it was necessary to filter the datasets to group ions into those that may contain structural similarities. For the work performed here, the first filter requirement was that peptide ions be doubly charged. The second filter criterion removes all peptides with missed cleavages to allow use only of peptide ions where the location of the protons is known. Next peptide ions were divided into those containing a c-terminal arginine or lysine residue. Finally, within these two subgroups, the peptides were further divided by length (number of amino acids). Size paramterization was performed as described below for each of these groups of peptide ions. Matrix manipulation was achieved using the MATLAB software suite.67
To determine the contribution of each amino acid residue to the overall size of the peptide ions, those sequences estimated to exhibit similar gas-phase structures are selected (see selection criteria above and discussion below). As described previously, from ion mobility measurements for the peptide ions within a parameterization set,37–40 it is possible to establish a system of equations relating size (ion mobility) to the amino acid composition using equation 1,
In equation 1, i and j represent a given peptide ion in the parameterization set (i = 1 to m, where m is the total number of peptides in the set) and the given amino acid residue (j = 1 to n where n is the number of separate amino acids), respectively. X represents the frequency of occurrence of the jth amino acid in the ith peptide of the parameterization set. The variable y is related to the ion mobility (represented here by a calibrated tD) of the ith peptide ion. For these experiments y is calibrated to obtain a reduced tD. Because peptide ion size is correlated to mass, it is necessary to calibrate the system such that differences in y within a subset of peptide ions are associated with peptide composition and sequence rather than differences in mass alone. That is, dividing the tD of a peptide ion by that of a “model” peptide ion of the same mass (obtained from a second-order polynomial fit to the tD versus molecular weight data) captures the variability in y at given masses. This variability is presumably determined largely by differences in peptide amino acid composition and sequence. Finally, because the ratio of tD values is the same as the ratio that would be obtained for collision cross sections, values of p are referred to as intrinsic “size” parameters. In equation 1, p represents the intrinsic size parameter of the jth amino acid.
The m/n = 1 diagonal of the variance-covariance matrix of the size parameters (Mp) provides the variance for the size parameter pn where69
In equation 5, corresponds to the residuals ( = y − Xp) of the individual equations.70 Errors representing one standard deviation can be obtained as the square root of the variance for each intrinsic size parameter. For the study presented here, the size parameters have been determined for groups of peptide ions having the same length within the lysine- and arginine-terminated subgroups (see above). The size parameters for the c-terminal residues have been maintained at the previously reported values of 1.230 and 1.150 for lysine and arginine, respectively.37 This has been performed in order to remove any effect that might treat these parameters as “compensating” residues due to their single occurrence in every peptide ion sequence.
Figure 1A shows the values of the intrinsic amino acid size parameters obtained for doubly-charged, arginine-terminated peptide ions containing 12 amino acid residues. Several trends are worth noting. First, nonpolar aliphatic residues generally have larger intrinsic size parameters (i.e., they have a greater contribution to peptide ion size) than polar aliphatic residues. This is very similar to the trend observed for singly-charged, lysine-terminated peptides and it has been suggested that stronger interactions between the charge site and polar residues may account for the difference in size.37 Another similarity is that the size parameters for the aromatic residues are intermediate in value to those of the nonpolar aliphatic and the polar aliphatic residues. Additionally, the size parameters for proline and glycine are relatively small. When compared to the previous work,37 the size parameter for valine obtained from this peptide ion group is relatively large. The intrinsic size parameters for histidine and cysteine are the smallest determined for this parameterization set. Finally, it should be noted that the size parameter errors for cysteine, histidine, methionine, and tryptophan are relatively larger than those of other residues. This can be attributed to the relatively low level of occurrence of these amino acids in the peptide ion group used to obtain parameters. For example, the numbers of occurrence of these respective peptides in the 102 peptides in this group are 7, 3, 12, and 14, respectively. In comparison, alanine occurs 118 times within the same parameterization set.
Previously it has been demonstrated that intrinsic size parameters can be used to predict peptide ion collision cross sections.37–40 The study showed that predictions improved upon restricting the sizes and types of peptide ions used to obtain the parameters. The reasoning for the improvement is that ions exhibiting similarities in length, composition (i.e., no missed cleavages), charge, and C-terminal residue (R or K) are more likely to adopt related gas-phase conformations; these similarities would be reflected in the intrinsic amino acid size parameters and thus lead to greater prediction accuracy for peptides within a subset. Indeed, in a previous study collision cross section prediction accuracy decreased by as much as a factor of two when size parameters from one parameterization set were used in cross section calculations for another set.39 For the present study, seventeen peptide subgroups have been extracted from the annotated proteome dataset. Figure 1B shows the average size parameters obtained from each of the parameterization sets (peptides of different length) for arginine- and lysine-terminated peptides. For the former, average values were obtained from intrinsic size parameters determined for peptides having residue lengths of 7, 8, 9, 10, 11, 12, 13, and 14 to 15. The last grouping is required because of an insufficient number of peptide ions containing either 14 or 15 amino acid residues. For lysine-terminated peptides, size parameters from peptide groups with lengths of 7, 8, 9, 10, 11, 12, 13, 14, and 15 residues were obtained. Figure 1B shows that similar trends in size parameters are obtained for the different peptide ion subgroups.
Size parameters can be used with amino acid composition to predict reduced tDs using equation 1. Because peptide ion tD values are calculated for the ions used to obtain parameters, the calculations can be termed retrodictions. Previously we have shown that retrodictions are very similar in accuracy to bona fide predictions and therefore we shall use the term predictions throughout this work.39 The predicted tDs can be compared with experimental values to assess the quality of the intrinsic size parameter determination for each dataset. As an example consider the peptide ion [NTTIPTK+2H]2+ from the heat shock protein SSC1. This seven-residue peptide ion has a tD peak centered at 36.31 bins. From a polynomial fit to the tD versus molecular weight data, it is observed that a “model” peptide of the same m/z (774.4 Da) would have a peak centered at a tD of 36.65 bins. Thus the reduced tD for [NTTIPTK+2H]2+ would be 0.991 (36.31/36.65). The predicted reduced tD would be calculated according to equation 1 as XNpN + XTpT + XIpI + XPpP + XKpK (0.143 × 0.883 + 0.429 × 0.967 + 0.143 × 1.003 + 0.143 × 0.936 + 0.143 × 1.23). The calculated reduced tD for this peptide is 0.993 corresponding to a drift bin value of 36.40. This is within 0.25% of the 36.31 value associated with the peak. This is significantly more accurate than the 36.65 value (0.94%) obtained from the polynomial fit to the tD versus molecular weight data. Supplementary Table 1 shows a comparison of experimental and theoretical tD values for all peptides used in this study. On average, experimental and theoretical tDs agree to within ±1.8%.
To better understand the efficacy of a size parameter prediction of the data, it is instructive to make comparisons to the polynomial fit for a group of peptide ions. Figure 2 shows the ratios of predicted and experimental tDs obtained for both the size parameter fit and the polynomial fit. These have been performed for arginine-terminated peptide ions of 12 amino acid residues in length using the size parameter values depicted in Figure 1A. In comparison, all predicted tD values are within 8% of experimental values using the polynomial fit. All predicted tDs are within 5% of experimental values using the size parameters. Additionally, the data for the size parameter fit is more compressed around the unity line indicating a higher level of accuracy. This increased density of data points in this region is an indication of the tD prediction improvement obtained when using size parameters. Another way to visualize this improvement is to compare the number of ions in the parameterization group that are accurately predicted to within ±1%. The 1% accuracy threshold has been selected as being representative of the typical experimental accuracy of ion mobility measurements.43 Use of size parameters results in ~40% of all predictions meeting this accuracy threshold; the use of a polynomial fit to tD versus molecular weight data results in ~18% of all predictions reaching this same level of accuracy. Thus there is more than a 2-fold improvement in predictive capabilities using the size parameters compared to the polynomial fit. This advantage exists for higher accuracy thresholds as well. For example, an improvement of a factor of ~1.7 is observed for predictions that are within 2% of experimental values. Here we note that size parameters obtained from peptide ions of this size provide the most accurate predictions. That said, the average improvement for arginine-terminated peptides of all sizes using the 1% accuracy threshold is ~50%. For all comparisons reported here, a second-order polynomial fit has been used because it has been shown to provide the greater prediction accuracy compared to higher-order polynomials and a linear least squares fit.
Although the discussion has focused on the superiority of the size parameters in predicting tDs to within 1% and 2% of experimental values, it is worthwhile considering the range of accuracy over which this advantage holds. Consider Figure 3 which shows the average fraction of the peptides correctly predicted as a function of accuracy threshold. Again a comparison is drawn between the prediction capabilities of the size parameter fit and those of the polynomial fit to tD versus molecular weight data. The data shown in Figure 3 suggests that a significant advantage in predictive capabilities is attainable using intrinsic size parameters over an accuracy threshold range of ±0.5% to ±6%. At higher accuracy threshold values, both models do nearly as well in predicting tD values.
To determine how intrinsic size parameters would aid peptide identification efforts, it is useful to consider two factors influencing the quality of the fit. This is accomplished by comparing the predictions obtained for specific peptide ions with those that would be obtained for nearly all peptide ion sequences at the same m/z values. Consider the peptide ion [QAYAVSEK+2H]2+ from the 60S ribosomal protein L4 A. Using the polynomial fit to the tD versus molecular weight data for the eight-residue peptides, a reduced tD for the peptide ion [QAYAVSEK+2H]2+ is determined to be 1.037. The predicted reduced tD obtained using the appropriate intrinsic size parameters is 0.995. Thus, the prediction accuracy is ~0.041 or ~4.1%. A sampling of the complete list of lysine-terminated peptide ions ranging in length from 7 to 10 amino acids and within 0.01 Da of the precursor ion mass (894.45 Da) yields ~7.13×105 separate sequences. Predicted drift tDs for all possible peptide sequences have been computed using the intrinsic size parameters obtained from the 7-, 8-, 9-, and 10-residue, lysine-terminated peptide ion groups. It is observed that ~4% of all isobaric sequences have predicted tDs that are within the prediction accuracy (±4.1%) of the experimental sequence. In a sense, this prediction accuracy for incorrect peptide ion assignments can be considered a false discovery rate and will be useful in formulating a peptide ion identification scoring scheme outlined below. Thus, for this peptide ion, the predicted reduced tD outperforms those obtained for ~96.0% of nearly all possible sequences at the same m/z.
From such an analysis of interfering sequences, one can determine the degree of overlap at different prediction accuracy thresholds. This is shown in Figure 4A. Here consider only the trace with the solid square symbols as this represents data for peptide sequences matching the mass (894.45 Da) of the peptide ion [QAYAVSEK+2H]2+. As the prediction accuracy threshold increases from 0.005 to 0.030 the fraction of total peptide ion sequences within the required threshold value for a match with the experimental value increases slowly from ~0.00 to ~0.02. Going from a prediction accuracy threshold of 0.030 to 0.040, the fraction of total sequences predicted accurately doubles to ~0.04. Above this value, the fraction of predicted sequences increases dramatically to 0.21, 0.63, and 0.88 at accuracy thresholds of 0.050, 0.060, and 0.070, respectively. Above an accuracy threshold of 0.070, the fraction of predicted sequences begins to level off approaching a value of 1 resembling a sigmoidal dependence. The data can be fitted with an expression for the sigmoidal curve intensity (I) according to,71
where the variables A and B represent the minimum and maximum values of the sigmoidal curve (0 and 1 in this case), respectively. The variables x0 and w represent the prediction accuracy threshold value associated with the inflection point of and a width factor of the sigmoidal curve, respectively. Using a prediction accuracy threshold value of ~0.060 to represent x0 and a value of ~0.006 for w, the data for competitive assignments to the peptide ion [QAYAVSEK+2H]2+ can be fit as shown in Figure 4A.
The comparison of overlapping competitive peptide ion assignments can be performed for other assigned peptide ions from the proteome database. For example, Figure 4A also shows data for accurately predicted interfering sequences having the same masses as the peptide ions [EAYVPATK+2H]2+ and [LNLFLSTK+2H]2+ from the proteins suppressor protein STM1 and isocitrate dehydrogenase, respectively. The data for competitive assignments of the former peptide ion also reveals a sigmoidal dependence albeit x0 and w values are shifted to higher values (~0.120 and ~0.009, respectively). The curve obtained for the latter peptide ion reveals a pseudo-sigmoidal dependence where the x0 and w values are shifted to lower values (~0.007 and ~0.003, respectively). The reduced tDs for the peptide ions [EAYVPATK+2H]2+ and [LNLFLSTK+2H]2+ are 1.104 and 1.023. Thus it is observed that as the reduced tD increases, values for x0 and w providing the best fit to the data increase as well. This observation is somewhat intuitive as a histogram of reduced tDs at a given m/z value reveals a Gaussian distribution centered about 1.000. That is, the majority of the reduced tDs are close to unity. Therefore, higher prediction accuracy thresholds would be required to obtain matches between competitive ion assignments and experimental features exhibiting reduced tDs that are significantly removed from 1.000.
To obtain a mathematical expression for a simple scoring scheme it is possible to use the data presented in Figure 4A. Examination of this data shows the dependence of a false discovery rate on two factors. The first factor is the overall prediction accuracy and the second factor is the magnitude of the reduced tD of the experimental peak. As described above and demonstrated in Figure 4A, these two factors are correlated. One way to estimate potential false discovery rates for dataset features is to reconstruct sigmoidal curves (Figure 4A) for given reduced tD values. As a first approximation this can be accomplished by examining the dependence of w and x0 on reduced tD. In Figure 4B and 4C this dependence is depicted for w and x0, respectively. Here the dependence is derived as a function of the deviation of the reduced tD from unity (d). The deviation is the fraction difference of the reduced tD from the “model” peptide ion obtained from the polynomial fit to tD versus molecular weight data. For the peptide ions [QAYAVSEK+2H]2+, [EAYVPATK+2H]2+, and [LNLFLSTK+2H]2+ having reduced tDs of 1.037, 1.104, and 1.023 the deviation values are 0.037, 0.104, and 0.023, respectively. A linear least squares fit of the data in Figures 4B and 4C provides the dependence of the sigmoidal curve variables on d. For the w and x0 variables this dependence is 0.0803×d+0.0013 and 1.1489×d-0.0022, respectively.
With the prediction accuracy dependencies on reduced tD deviation established, it is possible to construct estimated false discovery rate curves that are specific for dataset features of given reduced tDs. This is accomplished by substituting the w and x0 dependencies as well as values for A and B into equation 6 yielding,
It is instructive to consider the false discovery rate at the limits of high- and low-confidence matches to experimental reduced tDs. A low-confidence assignment would consist of a small reduced tD deviation and a large prediction accuracy threshold. Using values of d = 0 and x = 0.15 (a worst case scenario based on examination of database values), the exponential expression in equation 7 would approach zero and the fraction of competitive peptides predicted accurately becomes 1. A high-confidence assignment where d = 0.15 and x = 0, would result in prediction accuracy values approaching 0 as the exponential expression approaches 3.44×105.
A simple scoring scheme for aiding peptide ion identification can be devised based on equation 7. Because the power in the exponential expression in equation 7 essentially determines the false discovery rate, this expression can be used to provide a score for potential sequence matches. For example, the power expression ranges from −117.07 to 12.74 for low- and high-confidence matches, respectively. A scoring scheme can be set up of the form
Here, k is an arbitrary variable used to shift the scoring range onto a positive scale. L is an arbitrary variable used to scale the output score. Values of 117.08 and 0.7703 for k and L, respectively, provide output scores that range from 0 to 100 for nearly all peptide sequences.
To evaluate the new scoring approach, consider the peptide ion [VSGVSLLALWK+2H]2+ from the 40S ribosomal protein S23 which has a reduced tD of 1.047. The predicted reduced tD for this peptide ion is 1.036. Using x = 0.011 and d = 0.047, S is determined to be 96.38. In the yeast proteome database used to derive the intrinsic size parameters (both arginine- and lysine-terminated peptide ions), there are 7 different peptide ions that are within ~1 Da of the molecular weight (1171.702 Da) of the peptide ion [VSGVSLLALWK+2H]2+. None have higher scores than the correct peptide; scores for these sequences range from 90.15 to 95.80. Here we note that caution should be used with such a scoring scheme especially when comparing values for species for which reduced tD deviations are significantly different. That said, the results shown above for a peptide ion exhibiting moderate prediction accuracy and reduced tD deviation are encouraging and suggest that in the future, a similar approach may be useful in helping to weed out false positive identifications by indicating more probable matches to experimental data.
Additional comparisons of peptide ion scores are presented in Table 1. Here, the scores for 10 peptide sequences (selected at random) are listed. For half of the comparisons, the assigned peptide sequence yields the highest score when compared to other database peptide sequences within ~1 Da in mass. In two other instances the assigned peptide sequence yields the second highest score. In the remaining three instances, the assigned peptide score is the median score or higher. It is instructive to consider the cases where the assigned peptide ion didn’t score as highly as other sequences. For example, the peptide ion [IGTDIQDNK+2H]2+ yields the relatively high score of 96.31. However, it is the fourth highest score within a group containing nine total peptide sequences. Scores of the three other peptide sequences range from 96.51 to 97.07. These values are very similar to that obtained for the assigned peptide ion. In this situation, several peptides that are within ~1 Da of the assigned peptide ion in mass have predicted tDs that are similar to the experimental tD. As such, the clustering of such high scores does not warrant discarding the assigned peptide ion sequence. Rather, additional evidence would be required to confirm the peptide ion assignment.
It is instructive to consider the peptide ion sequences that have a higher score rank than the sequence associated with the correct identification (Table 1). Three of the assigned peptide ions have scores yielding a rank of 3 (or lower). Two of these peptide ions have the highest mass fraction of polar residues compared with all other sequences in Table 1. The third peptide ion is one of the top five ions with respect to mass fraction of polar residues. Overall these three peptide ions have a higher average mass fraction of polar residues (52.5±15.2%) compared to the other sequences (30.9±12.7%) in Table 1. Currently, no sequence correlation can be drawn between incorrect peptide ion assignments and the identified ions presumably because of the limited number of comparisons available. Additionally no correlation can be made to exact peptide composition. However, it is noted that the incorrect sequences of higher rank for all three peptides also contain a higher mass fraction of polar residues than those sequences of lower rank. Consider the peptide ion [IIENAEGSR+2H]2+ having a mass fraction of polar residues of 46.4%. The two database peptides ions with scores of higher rank are [AQELAEATR+2H]2+ and [VLQDSGLEK+2H]2+. These peptide ions have mass fractions of polar residues of 49.3% and 59.4%, respectively. The average mass fraction of polar residues for the other scored peptide ions is 34.4±14.8%. This weak correlation suggests that the fraction of polar residues in peptide ion sequences can influence the scoring capability of the approach. That said, because of the limited amount of data, only a note of caution can be suggested in the scoring of such peptides. A greater elucidation of the effect of peptide ion sequence and composition on overall ion scores (and size parameters) requires the development of much larger proteome databases.
Several factors need to be addressed in order to improve the ability to aid peptide ion identification with ion mobility data. These include improvements in ion mobility instrumentation as well as to the method employed to determine instrinsic size parameters for different amino acids. As mentioned above, the development of instrumentation of higher resolving power would provide greater accuracy in the determination of ion mobilities and by association increased accuracy of intrinsic size parameters for different amino acids. In a related manner, higher resolving power may also allow the removal of interfering species affecting the mobility determination of peaks in proteomics mixtures. It is noted that a newer version of the Synapt HDMS system has recently been commercialized affording ~3 times greater resolving power. Additionally, careful studies of T-Wave separation parameters should be explored. It may be possible that many high-mobility species are travelling at the velocity of the voltage wave and are thus not separated as efficiently as other species.
Improvements in the determination of intrinsic size parameters may be enhanced by instrumentation developments in a different manner. For example, higher resolving power may allow the resolution of peptide ion conformer types (e.g., helices, partial helices, globules, and elongated structures). The resolution of structural types should allow increased parameterization of peptide ion subgroups. This would require the determination of correlations between peptide ion composition and (or) sequence to conformer types. It may also be necessary to employ other methods such as molecular dynamics simulations to assign structural types. Another factor that should aid the determination of conformer types is the construction of much larger databases. Many more sequence measurements would be required. As a note of caution, with larger databases comes the problem of increasing numbers of false positives. This is particularly problematic for data to be used in the determination of intrinsic size parameters. It is noted that a weighting factor can be incorporated into equation 3.69 Such a weighting factor may include the probability score obtained from protein database searches.
It is instructive to consider the relevance of using intrinsic amino acid size parameters to validate peptide ion assignments. In a recent publication, Zubarev and Gorshkov and their coworkers described how they addressed a basic tenet of both analytical and engineering sciences, the tenet being a requirement for “…the use of a technique for a model validation materially different (complementary) from the one employed in the model creation”.72 For peptide ion identification in proteomics analyses, a verification model that is not based on a fragment ion interpretation is required. Employing retention-time modeling algorithms, the authors found many peptide sequences, even those with high scores, illustrating significant deviations from the theoretical retention times. The observation was largely attributed to the effects of chimeric spectra as opposed to bias in the independent retention-time models. Similarly the present work illustrates how predicting the mobility and the use of a statistical strategy can provide increased specificity of database search results. Therefore, the use of accurate mass, retention-time, and ion mobility all as independent metrics of peptide validation should significantly reduce the false positives in complex mixture analysis.
Intrinsic size parameters for amino acid residues have been determined for a variety of peptide groups obtained from a yeast proteome database. In general the size parameters are very similar to those obtained from singly-charged, lysine-terminated peptide ions indicating a degree of similarity between the types of structures (or elements of structure) formed by singly- and doubly-charged peptide ions. Additionally, the size parameters are very similar for peptides of very different lengths (from 7 to 15 residues). These size parameters have been used to predict ion mobilities (tDs). Predictions of tDs using intrinsic size parameters are more accurate than predictions obtained from polynomial fits to tD versus molecular weight data. This ability is proposed as a means to aid peptide ion identification and a simple scoring scheme has been introduced.
The authors acknowledge support for the development of new instrumentation by a grant from the National Institutes of Health (1RC1GM090797-01). We are also grateful to Professor Rob Beynon from the University of Liverpool and Professor Chris Grant from the University of Manchester for providing the yeast samples. Finally, we are grateful to Tim Riley of the Waters Corporation for coordinating the sharing of data and helpful discussions.