|Home | About | Journals | Submit | Contact Us | Français|
Identifying protein post-translational modifications (PTMs) from tandem mass spectrometry data of complex proteome mixtures is a highly challenging task. Here we present a new strategy, named iterative search for identifying PTMs (ISPTM), for tackling this challenge. The ISPTM approach consists of a basic search with no variable modification, followed by iterative searches of many PTMs using a small number of them (usually two) in each search. The performance of the ISPTM approach was evaluated on mixtures of 70 synthetic peptides with known modifications, on an 18-protein standard mixture with unknown modifications and on real, complex biological samples of mouse nuclear matrix proteins with unknown modifications. ISPTM revealed that many chemical PTMs were introduced by urea and iodoacetamide during sample preparation and many biological PTMs, including dimethylation of arginine and lysine, were significantly activated by Adriamycin treatment in NM associated proteins. ISPTM increased the MS/MS spectral identification rate substantially, displayed significantly better sensitivity for systematic PTM identification than the conventional all-in-one search approach and offered PTM identification results that were complementary to InsPecT and MODa, both of which are established PTM identification algorithms. In summary, ISPTM is a new and powerful tool for unbiased identification of many different PTMs with high confidence from complex proteome mixtures.
Post-translational modifications (PTMs) of proteins play an extensive and pivotal role in eukaryotic signal transduction, gene regulation, and metabolic control in cells.1, 2 PTMs determine protein conformation, activity, and localization, as well as stability.1 Abnormal PTMs are often a cause or consequence of many pathological and disease conditions.3 Although they are important, system-wide identification of PTMs remains a highly challenging task for many reasons. First, PTMs display enormous diversity and complexity.4 There are more than 300 PTMs that are known to occur physiologically. Vertebrate proteins often undergo multiple PTMs at the same time. It was estimated that for human proteins there are 8~12 modified versions for each unmodified tryptic peptide.5 Second, PTMs generate complex fragmentation patterns in tandem mass spectrometry. This complexity poses a significant challenge for subsequent data analysis. Third, PTMs are usually present at low stoichiometry and low-abundance. Fourth, global proteomic studies are often limited to a specific PTM due to the prerequisite of effective enrichment strategies that employ specific PTMs.2 An unbiased approach for system-wide identification of many different PTMs in complex proteome mixtures is highly desirable.
Currently, liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is the central method for identifying proteins with PTMs.1, 6 A particular advantage of this technique is that the MS/MS spectra contain information both on the intact full length peptide and on the masses of fragment ions from which amino acid sequences with specific sites of PTM can be derived. Typically, a modified peptide is identified through a process of peptide-spectrum-match (PSM) using programs such as SEQUEST7, Mascot8, or OMSSA9 to compare the observed spectral data to a protein database. Identification by these algorithms is based on a restricted database search in which MS/MS spectra are aligned with protein sequences bearing a few specified PTMs attached to specific amino acids. These approaches are not very effective at identifying large numbers of PTMs from complex proteome mixtures because the database search space expands exponentially as the number of PTMs increases. This increases the search time and false positive rate. For these reasons, it is generally advisable to include a limited number of variable modifications during database searches using conventional database search engines such as SEQUEST and Mascot.
To overcome the drawbacks of conventional database search methods, a number of strategies for unrestricted PTM identification have been developed, such as the de novo sequencing approach, sequence-tag approach, and spectral matching approach.10 Each approach has its own strengths and weaknesses. For example, the blind search methods, such as InsPecT11, can identify all possible PTMs at once, especially the unknown and unexpected PTMs. However, it is sensitive to the size of protein database, and a double pass strategy is recommended to increase the specificity when the database is over ten million amino acids.12 The double pass strategy identifies proteins in the sample using unmodified peptides (or minimally modified peptides) in a first pass, and then reduces the database to include only those identified proteins and search it for a wide selection of modified peptides in a second pass. Recently, Na et al developed a novel blind search tool named MODa, which can perform fast and unrestrictive searches for large scale databases of the human proteome.13 Using a dynamic programming method, MODa solved the limitation of the number of unrestrictive PTMs that can be allowed in each peptide. However, MODa was not designed to address the accurate localization of modifications to specific sites in the identified peptide sequences.13 Other approaches, such as ModifiComb and DeltAMT, can identify both known and unknown PTMs from complex mixtures in a quick fashion.10,14 However, because these spectrum match algorithms are based on the similarity of mass shifts and retention times between the unmodified form and its modified counterpart, they are insensitive to the quality of MS/MS spectra. Thus they may not accurately localize the modification site for a PTM that may occur on different amino acid residues in the same peptide.
Here we report a novel strategy, named ISPTM (iterative search for peptide identification with PTMs), for the systematic identification of PTMs with site-specific confidence from complex MS data. The iterative search strategy concept has been applied to some conventional search engines (such as X!Tandem, Mascot and SEQUEST) by early developers, but with the double pass strategy.14, 15 However, our ISPTM approach differs from these iterative search approaches by refining the MS/MS spectra instead of refining the database. ISPTM enables the identification of PTMs from complex peptide mixtures without prior identification of the proteins in the sample. The performance of ISPTM was evaluated using three datasets with different levels of protein complexity. Our results indicated that the ISPTM approach substantially increased the MS/MS spectral identification rate, demonstrated significantly better sensitivity for global PTM identification than the conventional all-in-one search approach, and provided PTM identification results complementary to those from InsPecT and MODa. Using ISPTM, we found that many biologically meaningful PTMs, as well as some chemical modifications, occurred on the nuclear matrix (NM) proteins of mouse pro-B 2A cells after ADR-induced DNA damage.
Two synthetic modified peptide mixtures were received from the Proteomics Standards Research Group at the Association of Biomolecular Resource Facilities. One (sample#1) contained a lyophilized mixture of 70 synthetic modified peptides, and the other (sample#2) contained the same mixture combined with a tryptic digest of six proteins from which the synthetic peptides were derived. More details of these samples are available at www.abrf.org/sprg, survey project 2011. Peptide samples were analyzed using an Easy nanoLC, equipped with a 75 Vm × 10 cm, Magic C18 AQ LC column, coupled to a Q-Exactive mass spectrometer (Thermo Scientific, San Jose, CA) as previously described.16
Datasets for an 18-protein standard mixture were downloaded from the Institute for Systems Biology (ISB) website.17 These datasets were analyzed by InsPecT and MODa. Briefly, the mgf files were searched against the ISB database of 18 standard proteins plus 92 contaminant proteins and 1709 Haemophilus influenzae RD proteins as background (obtained from ISB website). Carbamidomethylation of cysteine was used as the fixed modification. Only one modification was allowed in each peptide. In InsPecT, all the identified spectra were collected by applying a filter of p-value less than 0.05. An FDR cutoff of 0.01 was applied for the filtered spectra based on the F-score. In MODa, a probability score > 0.95 was applied, and an FDR cutoff of 0.01 was used to filter the spectra by the PSM score. In ISPTM, the OMSSA outputs were collected and filtered by p-value < 0.05, and an FDR cutoff was applied based on the OMSSA E-value.
Abelson virus-transformed mouse pro-B cell line 2A was maintained in RPMI media supplemented with 10% fetal bovine serum and 12.5 VM β-mercaptoethanol (Invitrogen, Carlsbad, CA). Pro-B 2A cells were treated with either 1 μM Adriamycin (Sigma-Aldrich, St Louis, MO, dissolved in DMSO) or DMSO alone for 4 hours. All buffers for NM sample preparation contained 1% protein phosphatase inhibitor cocktail 1, 1% protein phosphatase inhibitor cocktail 2, 1% protease inhibitor cocktail, and 1.2 nM phenyl methane sulfonyl fluoride (all from Sigma). Cell lysates were re-suspended in CSK buffer (10 mM PIPES pH 6.8, 100 mM NaCl, 300 mM sucrose, 3 mM MgCl2, 1 mM EGTA, 0.5% Triton X-100, 1 mM dithiothreitol (DTT), all from Sigma) and incubated on ice for 5 min. After centrifugation at 9000 rpm for 1 min, the pellet was re-suspended in low salt extraction buffer (42.5 mM Tris HCl pH 8.3, 8.5 mM NaCl, 2.6 mM MgCl2, 1% Tween 40, 0.5% deoxycholic acid) and incubated on ice for 5 minutes, followed by centrifugation at 9,000 rpm for 1 min. Again, the pellet was re-suspended in digestion buffer (10 mM PIPES pH 6.8, 50 mM NaCl, 300 mM sucrose, 3 mM MgCl2, 1mM EGTA, DNase (500 u/mL, Roche, Indianapolis, IN), RNase (500 u/mL, Ambion, Austin, TX), 0.5% Triton X-100, and 1 mM DTT) and incubated at room temperature for 1 hour with gentle shaking. The concentration of ammonium sulfate was adjusted to 0.25 M and the preparation was incubated for a 10 min extraction period, at room temperature. By centrifugation at 13,000 rpm for 1 min, the pellet was washed with 2 M NaCl buffer (2 M NaCl, 10 mM PIPES, pH 6.8, 10 mM EDTA). After a final centrifugation at 13,000 rpm for 1 min, the pellet was re-suspended in lysis buffer (8 M urea, 50 mM ammonium bicarbonate) and incubated at 37°C overnight to dissolve the nuclear matrix-associated proteins. All centrifugations were performed at 4°C.
Methods for sample preparation, trypsin digestion, strong cation exchange (SCX) chromatography, and LC-MS/MS analysis of the NM protein digests were described previously.18 Briefly, two NM protein samples were reduced with DTT and carboxyamidomethylated with iodoacetamide (IAA) at room temperature. Tryptic digestion was performed and the resulting peptides were desalted by solid phase extraction. SCX separation was performed and twenty fractions were obtained from each NM sample. These fractions were analyzed by LC-MS/MS on a nanoLC coupled with LTQ-Orbitrap-XL mass spectrometer (Thermo Scientific).
For the synthetic peptide data, we used a protein database containing the six proteins that the modified peptides belong to, plus four contaminate proteins, 1,990 background proteins from both the human and bovine proteomes, and the reverse sequences for all of these 2,000 proteins (obtained from the sPRG 2011 survey project). For the ISB data, the same database was used. For the NM data, MS/MS spectra were searched against i) a concatenated database containing 55,303 proteins from the international protein index (IPI) mouse database (version 3.52), ii) the commonly observed 262 contaminants (forward database), and iii) the reversed sequences of all 55,565 proteins from i and ii (reverse database). The OMSSA engine (v2.1.9, Linux version) was used for the database searches. The initial mass deviation tolerance of precursor ion was set to 0.02 Da and fragment ion tolerance was set to 0.5 Da for the NM data. The initial mass deviation tolerance was 0.02 Da and the fragment ion tolerance was 0.05 Da for the synthetic peptide mixture data. A maximum of 2 missed cleavages were allowed in peptide identification. We also employed a multi-blind search with the MODa software to analyze the NM data, using the same settings of ISPTM. Both ISPTM and MODa searches were performed using the computing resources available at the University of Nebraska Holland Computing Center (HCC).
The ISPTM approach consists of four steps. Step 1: the MS/MS raw data were pre-processed by DeconMSn and DtaRefinery as previously described.18 MS/MS spectral data were then stored in mgf format ready for the OMSSA search. Step 2: a basic search was performed with carboxyamidomethylation of cysteine as a fixed modification (no variable modifications in this step). The identified unmodified peptides were filtered by OMSSA E-value cutoff of < 0.01. Step 3: the identified MS/MS spectra from Step 2 were removed from the initial spectrum pool, and the remaining spectra were used for iterative searches. In this step, each cycle tested a small number of variable modifications (1, 2 or 3) until all combinations of the modifications were tested. For the synthetic peptides data, ISPTM searches were made in five variations: 1) testing 13 known modifications taken two-at-time (IS-13, 13×12/2 = 78 runs), 2) testing these 13 authentic modifications plus 13 false modifications (Supplemental Table 1) taken two-at-time (IS-26, 26×25/2 = 325 runs), 3) testing all (207) modifications in the OMSSA database, one-at-time (IS-Single, 207 runs), 4) testing all OMSSA modifications two-at-time (IS-Double, 207×206/2 = 21,321 runs), and 5) testing all OMSSA modifications three-at-time (IS-Triple, 207×206×205/(3×2) = 1,456,935 runs). For the ISB data, iterative searches were performed by the IS-single strategy, testing 207 modifications one-at-time. For the NM data, we removed 46 modifications (chemical modifications using stable isotope labels, Supplemental Table 2) that cannot occur in our biological samples, and iteratively searched the rest of the modifications using the IS-double strategy. The modified peptides that were identified were filtered by OMSSA E-value cutoff of < 0.1. In the case of multiple peptide sequences identified from the same MS/MS spectrum, the peptide sequence with the smallest E-value was retained. Step 4: all identification results were combined and exported with a fixed FDR, followed by calculation of site confidence score for each modification site.
To provide an empirical measure of confidence that a PTM site was correctly localized a probability-based significance was calculated using the site-determining product ions. Briefly, a probability distribution P(X) is based on the hypothesis that random sampling of fragment ions in an MS/MS spectrum follows a binomial distribution:
where p is the probability of matching a fragment ion in a sampling event, and N and k represent the theoretical and observed site-determining fragment ions from the MS/MS spectrum.
For each modified peptide, the site confidence (SC) score for a PTM at position i is calculated as:
where Pj is the false positive (FP) probability that a PTM is located at position i but not at position j in the same peptide.
To calculate the SC score, the MS/MS spectra were preprocessed to create a list of observed fragment ions that contained the 6 most intense fragment ions per 100 m/z units. Masses of theoretical ions for each identified peptide were obtained from MS-Product (http://prospector.ucsf.edu/prospector/). For each identified peptide with a PTM at position i, the alternative forms of modifications include: 1) the same PTM at other possible sites, 2) a different PTM with similar mass shift (<0.02 Da, identical to the OMSSA tolerance of precursor ions) in this peptide. For each alternative PTM at position j, N (total number of site-determining ions), k (number of observed fragment ions that matched the theoretical ion using a mass tolerance of 0.5 Da) and p (=0.06) were used to calculate the Pj. Then the SC score was determined by 1 minus the sum of Pj of all alternative forms.
The ISPTM work flow is summarized in Figure 1. Scripts written in Python were used to perform tasks including spectra refining, filtering the spectra of unmodified peptides, setting the pool of PTMs and the number of variable modifications in each iterative search, generating the commands for OMSSA searches, collecting and filtering the identification results, and annotating the site confidence of PTMs. The python scripts are fully automatic in each step, minimizing the user’s intervention. The outputs of ISPTM follows the same format of standard OMSSA csv outputs, with a new column indicating the SC score for each PTM site. The Python scripts and instructions have been deposit on Google Code (https://code.google.com/p/isptm-python/). The ISPTM analyses of the synthetic peptides and the NM data were both finished in less than 48 hrs.
In the synthetic modified peptide mixtures, peptide “NGDTASPKEYTAGR” with 3 different methylated forms of lysine (methylation, dimethylation and trimethylation) were identified in a single conventional search allowing all 13 modifications as variable modifications. In ISPTM, an iterative database search was applied and matches were found when the mono-, di- and tri-methyl modifications were tested, respectively (Supplemental Figure 1). We evaluated the performance of ISPTM using the synthetic modified peptide mixtures and compared it to the conventional all-in-one search. In total, 41 peptides were identified from 278 spectra by the conventional search of sample #1, while 45 peptides were identified from 358 spectra by the ISPTM method using the 13 modifications taken two-at-time (IS-13). Using a false discovery rate (FDR) < 0.1, only 13 peptides (out of 109 spectra) from the conventional search were acceptable, while 32 peptides (out of 239 spectra) were acceptable from the ISPTM method. A detailed comparison of conventional and ISPTM search results with different strategies for analysis of the synthetic peptide data is displayed in Supplemental Table 3.
The overall performance of the conventional search and ISPTM approaches with multiple strategies for sample#1 was compared using the receiver operating characteristic (ROC) plot (Figure 2A). The ROC plot demonstrated that an iterative search using the IS-13 strategy achieved the best discriminating power between the authentic and false positive identifications. The discriminating power was essentially the same if another 13 false modifications were included in the ISPTM search (IS-26). We further tested the performance of ISPTM on the synthetic peptides by using the 207 modifications in the OMSSA modifications pool in the search. In these analyses, variable modifications were used one-at-time (IS-Single), two-at-time (IS-Double) or three-at-time (IS-Triple). ROC analysis indicated that IS-13 and IS-26 strategies had higher discriminating power than the IS-Single, IS-Double, and IS-Triple strategies. Interestingly, performances of all these ISPTM strategies were essentially the same if ROC analysis was applied to identification results from sample #2 (Figure 2B). Overall, all the iterative searches showed superior discriminating power compared to the conventional all-in-one approach. The discriminating power of the IS-Single, IS-Double, and IS-Triple strategies increase for sample #2, compared to sample #1, mainly because the peptides in sample #2 are more complex than in sample #1 and many natural modifications, such as oxidation of methionine and acetylation of the protein N-terminus are present.
We also noticed that the PSM score may change when different numbers of variable modifications were used. As shown in Figure 2C, for the same spectra with same identification results, the PSM score [represented by the −log10(E-value)] was plotted for the conventional and the IS-Double search results. The regression line with a slope ≈ 1 and an intercept of 1.29 indicates that the PSM score for the IS-Double search is slightly higher than the PSM score for the all-in-one search. This is the reason why the discriminating power decreased in the conventional all-in-one search. However, there is almost no difference in the PSM score when the IS-Double and IS-Triple search results were compared (slope = 1, intercept ≈ 0, Figure 2D).
We employed InsPecT, MODa, and our ISPTM approach to analyze the ISB data. By FDR < 0.01, InsPecT, MODa and ISPTM identified a total of 5,790, 11,233 and 9,556 spectra, respectively. Among these spectra, 1,639, 3,012 and 2442 were identified as modified peptides by InsPecT, MODa and ISPTM, respectively. All peptide/protein identifications are listed in Supplemental Table 4, with a PTM frequency matrix11 was developed for the modified spectra for each program. A Venn gram shows the different coverage of identified peptides, as well as the modified peptides (peptides with identical sequence, modification site, and mass shift) by these approaches (Figure 3A and 3B). Overall, InsPect, MODa and ISPTM provide complementary identification coverage for the ISB dataset. MODa identified more peptides than InsPect and ISPTM. As shown in Table 1, the most frequent modifications identified by both InsPecT and MODa results were the sodium and potassium adducts. Other frequent PTMs identified by both programs include oxidation of methionine, carbamidomethylation of cysteine, and several biologically relevant PTMs, such as dehydration by beta-elimination of serine and threonine, and peptide N-terminal acetylation. In the ISPTM analysis, the most frequent modification was deamidation of asparagine. Consistent with the InsPecT and MODa results, oxidation of methionine, acetylation of the protein N-terminal, and beta-elimination of serine and threonine were identified. Interestingly, ISPTM also identified many peptides with cyclization of the N-terminal S-carbamidomethyl cysteine (Pyro-CamC) and N-terminal pyro-glutamic acid (Pyro-Glu).
We evaluated the ISPTM approach using a pair of complex biological samples: nuclear matrix protein digests of mouse Pro-B cells before (Control) and after DNA damage (ADR-treated). Using the basic search for unmodified peptides, with E-value < 0.01, 9,315 and 5,527 MS/MS spectra were identified from the Control and ADR-treated samples, respectively. The number of false positive peptides, identified at the spectrum level, from the Control and ADR-treated datasets were 8 and 51 (both FDR < 0.01), respectively. Using the ISPTM approach with two variable modifications at a time, 28,595 modified peptide spectra were identified by Evalue < 0.1. By applying the FDR cutoff of 0.01, we identified 1,921 unique peptide sequences from 5,068 MS/MS spectra, of which 1,700 spectra were from the Control samples and 3,368 spectra were from ADR-treated samples. In the ISPTM results, 32.5% (625/1921) of the modified peptides were identified with an unmodified form in the basic search. At the protein level, 62.1% (907/1460) of the modified proteins were identified in the basic search. MODa was used to analyze the NM data as well. By FDR < 0.01, MODa identified 5,492 and 3,102 spectra of unmodified peptides as well as 1,857 and 4,437 spectra of modified peptides from the Control and ADR-treated samples, respectively. For the combined Control and ADR-treated nuclear matrix samples, ISPTM identified more unique peptides and proteins than MODa (Figure 3C and 3D). However, MODa identified more modified peptides with either identical sequences or identical modifications than ISPTM (Figure 3E and 3F).
The frequencies of identification for modified peptides obtained by the ISPTM approach were shown in Supplemental Table 5. First, we applied an SC cutoff of 0.8 to remove the unconfident modifications. Then manual curation was performed for the results. For example, we found that OMSSA assigned ubiquitination, methylation and sumoylation to lysines and arginines at the C-terminus of some peptides. But a modified lysine or arginine is generally not recognized by trypsin for digestion and therefore could not appear at the C-terminus of a tryptic peptide. We also identified some spectra with O-GlcNAcylation to serine and threonine. Because O-GlcNAc is readily lost as an oxonium ion during collision-induced dissociation,19 it renders the identification of O-GlcNAc modified peptides very difficult. Thus such assignments were removed from the identification results. As a result, we obtained 4,166 spectra identified for a total of 1,636 unique peptides (an entire list of identified peptides with PTMs is presented in Supplemental Table 6). The three most frequent modifications were carbamylation of lysine, carbamylation of the peptide N-terminal, and acetylation of protein N-terminal. For example, a peptide with carbamylation at lysine was shown in Supplemental Figure 2A. This modification is induced presumably by urea in the sample lysis buffer.20 Another reagent commonly used during proteomics sample preparation is iodoacetamide (IAA). It is used to alkylate cysteines exposed by reduction of disulfide bonds. We found that histidine, glutamic acid, aspartic acid, and lysine were also alkylated by IAA.21 A peptide with carboxyamidomethylation at histidine was shown in Supplemental Figure 2B. Pyro-Glu modification appears in 128 spectra (Supplemental Figure 2C), because N-terminal glutamine or glutamic acid residues are known to form Pyro-glu under aqueous conditions.22 Another frequent chemical modification was Pyro-CamC at N-terminal cysteines, as shown in Supplemental Figure 2D. This modification has been reported to be caused by enzymatic digestion of proteins that have been S-alkylated by IAA.23
Deamidation of asparagine and glutamine was identified in 247 and 34 spectra in the NM samples (Supplemental Figures 3A and 3B). It has been reported that deamidation is involved in the “DNA damage-induced cellular response”.24 Moreover, this modification can be caused chemically, especially during sample storage at higher temperature or when the asparagine or glutamine is followed by glycine.25 Two additional PTMs, oxidation of methionine and acetylation of the protein N-terminal, which are often included in routine database searches, were identified in 183 and 249 spectra, respectively (Supplemental Figures 3C and 3D). ISPTM results indicated that the chemical modifications are quite abundant in the proteomics samples. It is important that they are included in routine database searches because their presence may affect the identification of other modifications. For instance, Figure 4A shows an annotated MS/MS spectrum of peptide “GVLKVFLENVIR” derived from histone H4 position 57~68. Previous studies have reported that the lysine residue at position 60 (H4K60) can be acetylated26, ubiquitinylated27, or formylated28 physiologically. In our study, all three forms of modification on H4K60 were identified, but all peptides were also carbamylated on the N-terminal glycine (Figure 4B – 4D). Using the conventional database search approach, these spectra would not likely have been identified because carbamylation of the N-terminal glycine is not a common modification and therefore is not normally included as a variable modification.
The numbers of identified unique peptides corresponding to selected PTMs from the Control and ADR-treated NM samples are compared (Figure 5A). The number of peptides with deamidation of asparagine, which is the most frequent PTM, is similar in the two samples. The number of peptides with oxidized methionine decreased about 50% after ADR treatment. Oxidative stress may cause oxidation of methionine in vivo.29 However, this modification may also occur in vitro during the sample preparation. The number of peptides with dimethylated arginine increased in the ADR-treated sample. It has been reported that dimethylation of ribonucleoproteins (RNP) increases their ability to bind DNA and to promote gene transcription.30 Table 2 lists a number of peptides that were found to carry dimethylated arginine, and their corresponding proteins. Dimethylated arginine has been reported previously for some of these proteins: Hnrnpa0, Hnrnpa1, Pabpc1, Ewsr1, Snrpb, Hnrnph1, and Hnrnpu.30 A representative peptide with dimethylated arginine at Hnrnpa0 is shown in Supplemental Figure 4A. Neutral-loss of a monomethylamine (H2N-CH3) group indicated that this is a symmetric dimethylation site.31 Other RNP proteins (Hnrnpa1) and Pabpc1 (Supplemental Figure 4B) were identified with dimethylation, indicating that this modification is essential for normal mRNA metabolism. Interestingly, dimethylation of arginine at Ewsr1 (Supplemental Figure 4C), Snrbp, Rbm33 (Supplemental Figure 4D), Hnrnph1, and Hnrnpu were only observed after ADR treatment. The function of dimethylation on these proteins remains unknown.
A glycyl (GG) modification on lysine is a degradation signal for ubiquitination.32 Figure 5B shows the annotated MS/MS spectrum of “LIFAGKQLEDGR”, which is a signature tryptic peptide of the protein with K48 poly-ubiquitination.33 Interestingly, the ubiquitination modification site on histone H4 (H4K60) was found to be the site of formylation after ADR treatment. As a secondary modification that results from oxidative DNA damage, formylation of lysine in histone proteins may interfere with the signaling functions and thus contribute to the pathophysiology of oxidative and nitrosative stress.34 All the above data indicate that PTMs on many NM proteins, especially the core histones, were altered by treatment with ADR. Such modifications may change the activity and function of these proteins in response to ADR-induced DNA damage.
PTMs are extremely important for maintaining protein structure and function. We present in this paper a novel strategy named ISPTM to identify the proteins with complex patterns of PTMs from LC-MS/MS data. Compared to the conventional all-in-one search strategy, ISPTM can effectively control the search space by including a very limited number of PTMs in a search and has higher discriminating power for the true PTMs as we demonstrated in Figure 2A and 2B. In contrast, when a large number of PTMs needed to be tested in a sample, it will be increasingly difficult to use the conventional all-in-one search because of the exponential increase of search space and reduced PSM score to discriminate the true PTM identifications from the false identifications (Figure 2C). The unique feature of the ISPTM approach is that it performs an exhaustive search for hundreds of different modifications expected to be found in complex protein samples, including both naturally occurring and chemical modifications. Our data indicated that a large portion of peptides are chemically modified by carboxyamidomethylation, carbamylation and deamidation, but these chemical modifications are generally not considered in routine database searches. Importantly, our approach has demonstrated that identifying peptides with various (either chemical or biological) modifications in a sample can not only increase the spectral identification rate but also can increase the chance of identifying key protein regulators and their possible PTMs.
One limitation of ISPTM is that it is not designed to discover totally unknown modifications. All modified peptides are identified from a pre-defined pool of modifications. Nevertheless, the current UNIMOD database (www.unimod.org) contains more than a thousand modifications.35 This pool could be employed in place of the OMMSA pool, which currently contains 207 modifications. However, instead of testing all PTMs in this pool (i.e. Mascot error tolerant search), users can choose a small subset of interested PTMs to perform ISPTM. For a relatively simple mixture like the ISB data, we demonstrated that the performance of the ISPTM approach is equivalent to the blind searches engines InsPecT and MODa, when all possible modifications were tested.
When analyzing more complex proteome mixtures like the NM dataset here, ISPTM identified about 72% more spectra (14842/8594) of unmodified peptides, whereas it identified about 19% less (5068/6294) spectra of modified spectra, compared to the outputs of MODa. Here we interpret these results in several aspects: First, a restrictive engine such as OMSSA is better than unrestrictive engines in identifying unmodified peptides. Second, ISPTM separates the modified and unmodified peptides and applies FDR individually. By an FDR of 0.01 in the NM data, E-value cutoff of unmodified peptides was 0.01, while cutoff of modified peptides was 2.8E-6. This might be another reason that ISPTM identified fewer modified peptides than MODa. Third, both results contain false positive identifications even though an FDR cutoff has been applied. However, a fixed modification at a specific site in restrictive searches can minimize the artifacts. For instance, carbamylation (+43) only occurs at a peptide N-terminal and lysine, but we also observed a number of spectra were identified within other sites in MODa. Another advantage of the ISPTM approach over blind search or de novo methods is that the identification results are very easy to interpret, because all modifications are known and with clear site specificity. Finally, the large number of potential modifications provides the both algorithms with wide latitude for making assignments. Consequently, even peptides with strong scores can prove to be assigned incorrectly. Thus we strongly suggest that modification results be confirmed by manual sequencing and orthogonal approaches such as site-directed mutagenesis or MS/MS analysis of the synthetic peptide with a specific modification.36
In this paper we introduced an SC score method to access the site confidence of the identification results. Although both the SC score and a former A-score37 methods are based on a cumulative binomial distribution model (measuring the likelihood of matching at least the number of matched site-determining ions by chance), we think that the SC score developed in this manuscript may have several advantages. First, A-score is an ambiguous score that only distinguishes the top two candidate sites, but SC score considers all possible candidate sites. Second, A-score is restricted on the same modification at different sites (i.e., phosphorylation on S/T/Y). However, if many PTMs with identical or close mass shifts are involved in a search like in the case of ISPTM search, it is difficult to determine exactly which PTMs are occurring and distinguish them. For instance, deamidation and citrullination have exactly the same mass-shift. And the mass-shift between acetylation and tri-methylation is very close, differing only by 0.036 Da. Such a small difference is undetectable in low-resolution MS but detectable in high-resolution MS. Our scripts calculate the mass shifts of all possible PTMs included in the search and the mass tolerance of all identified peptides automatically, providing the SC score without user intervention. Third, because the SC score is a confidence score, users can apply a certain cutoff (i.e., 0.8) to filter out the ambiguous PTM identification results. To summarize, the SC score developed here allows for more PTM assignments in a high throughput fashion. However, in the current design of ISPTM, it does not allow for assignment of novel PTMs unless this novel PTM is included in the ISPTM search.
The OMSSA search engine was chosen for testing our ISPTM approach in the current study because it is open-source and platform-independent.9 However, in principle, ISPTM can be applied to other search engines, such as SEQUEST and Mascot. The computational resource required is an important concern for large scale PTM identification of complex proteome data. In this study, the cumulative CPU hours for analyzing the NM datasets by ISPTM was 4,535 hours, while the cumulative CPU hours for analyzing the NM datasets by MODa was 834 hours (Supplemental Table 7). Dependent on the number of cores/CPUs available for parallel computing, an ISPTM search of complex proteome datasets could be completed in a few hours. In this study, we used a 1,151 node Linux cluster for our analyses because we were analyzing up to 207 modifications. Such a large computing resource currently may not be available to all investigators. However, most studies would be expected to involve a smaller set of PTMs (less than 20). We have shown that the ISPTM approach has superior performance in testing 13 modifications compared to the all-in-one search. For those studies, a desktop PC would be suitable for iteratively testing one or two modifications at a time. Moreover, the issue of computing power should be addressed by the rapid development of modern computational technology such as supercomputers and cloud computing.38 Indeed, any academic researcher in the United States already has access to a large computational resource via OSG (Open Science Grid, https://www.opensciencegrid.org) or, more broadly, XSEDE (Extreme Science and Engineering Discovery Enviroment, https://www.xsede.org/).
In summary, we have developed a novel approach, termed ISPTM, for the systematic identification of PTMs in proteome samples. The ISPTM approach enables conventional database search methods to be used for systematic PTM identification. The results obtained with ISPTM demonstrated that chemical modifications such as carboxyamidomethylation of histidine, glutamic acid, aspartic acid, and lysine and carbamylation of lysine are abundant when IAA and urea are used in sample preparation. With the increasing size of the PTM knowledge database, the ISPTM approach will bring the level of PTM identification from the era of limited identification and quantitation to the level of global PTM discovery for complex biological samples.
We thank Dr. Lawrence Schopfer for the editing of this manuscript. The mass spectrometry data were collected in the Mass Spectrometry and Proteomics Core Facility at the University of Nebraska Medical Center (UNMC) which is supported by the Nebraska Research Initiative. We thank Jim Keagy from Thermo Scientific for arranging demo experiments on the Q-Exactive mass spectrometer. We thank the Proteomics Standards Research Group (sPRG) from the Association of Biomolecular Resource Facilities for providing the synthetic modified peptide mixtures. This work was completed utilizing the Holland Computing Center of the University of Nebraska and the Open Science Grid. This work was financially supported by the Department of Pathology and Microbiology at UNMC and NEHHS LB606 (S.J.D), National Institutes of Health (NIH) grants AI076475 (Z.Z.). X.H. and H.P. were supported in part by scholarships from the Chinese Scholarship Council and M.L. was supported by a scholarship from the College of Medicine at UNMC.