|Home | About | Journals | Submit | Contact Us | Français|
The identification of peptides by microcapillary liquid chromatography-tandem mass spectrometry (µLC-MS/MS) has become routine because of the development of fast scanning mass spectrometers, data-dependent acquisition, and database searching algorithms. However, many peptides within the detection limit of the mass spectrometer remain unidentified because of limitations in MS/MS sampling speed despite the dynamic range and peak capacity of the instrument. We have developed an automated approach that uses the mass spectra from high resolution µLC-MS data to define the molecular species present in the mixture and directs the acquisition of MS/MS spectra to precursors that were missed in prior analyses. This approach increases the coverage of the molecular species sampled by MS/MS and consequently the number of peptides and proteins identified during the acquisition of technical or biological replicates using a simple one-dimensional chromatographic separation. The combination of a unique workflow and custom software contribute to the improved identification of molecular features detected in proteomics experiments of complex protein mixtures.
In shotgun proteomics, protein mixtures are routinely digested to peptides and analyzed by microcapillary liquid chromatography-tandem mass spectrometry (µLC-MS/MS). The MS/MS spectra are usually sampled by automated data-dependent acquisition (DDA), where information acquired in a previous scan is used to make decisions in how the data is acquired in subsequent scans1;2. When using data-dependent acquisition, the number of peptides sampled is limited by the MS/MS sampling speed despite the dynamic range and peak capacity of the mass analyzer. A single spectrum can contain over a hundred different molecular species, of which only a handful are analyzed by MS/MS prior to the next full scan. Although the most abundant peptides persist over many spectra, the peptides sampled between technical replicates can vary by as much as 30%3. Thus, in a complex protein digest, many peptides can remain unsampled in technical replicates, and most of the instrument time is spent re-sampling the most abundant features in the sample.
Recently, we reported the peptide feature detection algorithm, Hardklör, that can be used for the detection of persistent peptide isotope distributions (PPIDs) from high resolution μLC-MS data4. Here we describe the use of Hardklör combined with an LTQ-Orbitrap mass spectrometer for improving the fraction of peptides sampled in a mixture from either technical or biological replicates. Our approach uses the high peak capacity of the mass analyzer to resolve and detect peptide features that would not normally be sampled in the presence of more intense interfering signals. By cataloging detectible features, we can efficiently increase our sample coverage by prioritizing missed PPIDs in replicate runs through the use of m/z inclusion lists. We call our approach post analysis data acquisition (PAnDA), and it uses the information in previous runs to direct the acquisition of MS/MS spectra in subsequent analyses. Our approach has been fully automated with full control of the instrument, post-data analysis, method modification, and reanalysis controlled completely through in house developed software. We demonstrate that PAnDA significantly improves the identification of peptides/proteins in technical replicates of a C. elegans protein digest and improves the fraction of detected features that are sampled by MS/MS.
Caenorhabditis elegans and Escherichia coli strains were obtained from the CGC at the University of Minnesota. All other reagents were purchased from Sigma-Aldrich (St. Louis, MO) unless specified otherwise.
C. elegans (N2 strain) were grown on enriched peptone plates seeded with the OP50 strain of E. coli at 20°C. Worms of all developmental stages were washed from the plates with M9 buffer (22mM KH2PO4, 22mM Na2HPO4, 85mM NaCl, 1mM MgSO4; VWR, West Chester, PA) and sucrose floated to remove bacterial contamination. The worms were then lysed in 50mM ammonium bicarbonate pH 7.8 using the small probe of a sonic dismembrator model 100 (Thermo Fisher Scientific, Pittsburgh, PA) for 5 cycles of a 20 second continuous pulse followed by a 60 second ice incubation. The lysate was then centrifuged at 4000 rpm for 10 minutes at 4°C in an Eppendorf 5417R microcentrifuge (Westbury, NY) to remove cell debris. A second centrifugation at 14000 rpm for 10 minutes at 4°C is then performed to separate the soluble lysate from the insoluble lysate.
The lysate was denatured using 0.1% RapiGest SF (Waters Corporation, Milford, MA) in 50 mM ammonium bicarbonate pH 7.8. To increase denaturation, the lysate was vortexed and boiled at 100°C for 5 minutes. After cooling, the lysate was reduced with 5 mM DTT, alkylated with 15 mM IAA and digested to peptides using trypsin (Promega, Madison, WI) at a substrate to enzyme ratio of 100:1 for one hour at 37°C with shaking. The lysate was then treated with 200 mM HCl to remove RapiGest from the sample.
The C. elegans digest (4 µg) was loaded from the autosampler onto a fused-silica capillary column (75-µm i.d.) packed with 40 cm of Jupiter C12 material (Phenomenex) mounted in an in house constructed microspray source and placed in line with a Waters NanoAcquity HPLC and autosampler. The column length and HPLC were chosen specifically to provide highly reproducible chromatography between technical replicates. Peptide elution was performed using two buffer solutions: Buffer A was a mixture of 94.9% water, 5% acetonitrile, 0.1% formic acid and Buffer B was a mixture of 99.9% acetonitrile and 0.1% formic acid. The gradient program consisted of four steps totaling 100 minutes: 1) 60 minute gradient of 5 to 30% Buffer B, 2) 10 minute gradient of 25 to 75% Buffer B, 3) the solvent composition was kept at 75% Buffer B for 10 min, and 4) column re-equilibration with 95% buffer A for 20 minutes. Tandem mass spectra were acquired using either traditional data-dependent acquisition with dynamic exclusion turned on or PAnDA (see below). In both cases, a single high resolution mass spectrum was acquired at 60,000 resolution (at m/z 400) in the Orbitrap mass analyzer in parallel with 5 low resolution MS/MS spectra acquired in the LTQ.
PAnDA was performed in a manner similar to the standard data-dependent acquisition analysis; however, the five data-dependent MS/MS scans were performed on ions selected from an m/z inclusion list generated from software written in house. Inclusion lists were generated from prior analyses of the same sample on the same column. If an ion observed in the profile scan matched an ion from the inclusion list, within a specified retention time window, then it was isolated for fragmentation. If no ions were matched between the inclusion list and the profile scan, the next most abundant ion from the profile scan was selected for fragmentation.
To generate the inclusion lists for PAnDA, the high resolution Orbitrap mass spectra were analyzed by the feature detection software Hardklör to identify the persistent peptide isotope distributions (PPIDs) detected in the µLC-MS analysis. Hardklör identified features in any one spectrum using the following parameters: charge states between +1 and +5, signal-to-noise threshold of 1.0, deconvolution of up to three isotope distributions per 5 m/z spectrum window, and a minimum correlation threshold of 0.90. PPIDs were defined as features between spectra within a 10 ppm monoisotopic mass tolerance and with identical charge state that persisted in at least three of four consecutive scans. PPIDs from each analysis were stored in a database to be tracked across technical replicates. PPIDs were matched across replicates if their monoisotopic masses differed by less than 10 ppm, their charge state was identical, and they had overlapping retention times. Each PPID match between replicates was combined into a single entry in the database with a retention time window spanning the range the PPIDs was found in all analyses. The database PPIDs were matched to the ions that were fragmented by MS/MS using two parameters: a 10 ppm mass tolerance between the MS/MS m/z value and the base isotope peak of the PPID, and overlapping retention time with a +/− 15 second tolerance around the PPID retention time window. Any PPID that could not be matched to an existing MS/MS spectrum was placed in an inclusion list for isolation and fragmentation in replicate analyses.
Inclusion lists consisted of an m/z value and a narrow retention time window. The base isotope peak of PPIDs identified by Hardklör was used for the m/z value. The retention time window was defined as the time interval over which Hardklör observed the PPIDs, with a +/− 15 seconds tolerance. After each replicate PAnDA analysis, the precursor masses of the fragmentation spectra were compared with the database of PPIDs. The retention time windows of existing PPIDs in the database were adjusted to account for observed shifts in the replicate chromatography. PPIDs that had been sampled and now had an MS/MS spectrum were removed from the inclusion list. Additionally, the high-resolution scans of each subsequent PAnDA analysis were also analyzed by Hardklör to identify additional PPIDs that were not detected in prior runs and were added to the database of PPIDs.
The PAnDA analysis was automated using custom software that used Xcalibur libraries to control the instrument acquisition and modify the method files. This program was implemented in C++ and used the Microsoft Foundation Class (MFC) to create a Windows graphical user interface. Our prototype software can 1) communicate with the mass spectrometer and initiate a sample analysis using an Xcalibur method, 2) extract the MS and the MS/MS data from the RAW file, 3) process the data with Hardklör as described above, and 4) modify the previous instrument method based on the Hardklör results. This process could be repeated a user defined number of times. Software to perform this analysis is available for noncommercial use at http://proteome.gs.washington.edu/software/panda/.
Tandem mass spectra were searched against the Wormpep protein annotations (downloaded from http://wormbase.org) and a shuffled decoy using the database searching algorithm SEQUEST5. Peptide spectrum matches from SEQUEST were combined for all replicates and post-processed using Percolator6 and assigned q-values to each spectrum using the data from all replicates together as described previously7;8. The peptide spectrum matches were filtered using a threshold of q-value ≤ 0.01 and assembled into protein identifications using DTASelect9. Total peptide and protein identifications across the multiple DDA and PAnDA analyses were compared using Contrast9.
A total of six technical replicates were completed, using both the standard data-dependent analysis (DDA) and the post-analysis data acquisition (PAnDA), as described in the methods. The first µLC-MS/MS analysis was performed using standard data-dependent acquisition and provided the base set of persistent peptide isotope distributions (PPIDs). After this first µLC-MS/MS run, five additional replicates were performed consisting of two analyses: a replicate analysis using PAnDA and a control analysis using the standard DDA analysis. The two types of analyses were alternated to minimize artifactual differences resulting from systematic errors occurring over time (Figure 1).
We have shown previously that the number of features detected by Hardklör from a µLC-MS/MS analysis can far exceed the number of molecular species that can be sampled by data-dependent MS/MS4. Figure 2A illustrates the PPIDs detected by Hardklör from the first µLC-MS/MS of C. elegans peptides. The PPIDs were detected over a wide dynamic range throughout the entire analysis.
In Figure 2B, the PPIDs are plotted that were not sampled by MS/MS. A large portion of the detected PPIDs remain unsampled after the first analyses and these features tend to be the low to moderately low abundance signals. Some PPIDs of moderate intensity remain unsampled by data-dependent acquisition for reasons that are not entirely clear.
PAnDA increased the fraction of detectable peptide features that are sampled for MS/MS (Figure 3A). The six replicates that did not use PAnDA, performed MS/MS on only 8,345 of 25,004 (33.4%) detected PPIDs found in at least one of the samples. In contrast, PAnDA sampled 20,339 of 25,829 (78.7%) total detected PPIDs. If only the PPIDs that were found in 3 of the 6 analyses are considered real quantifiable features, then standard DDA sampled 5,115 of 9,339 (54.8%) and PAnDA sampled 10,721 of 10,991 (97.5%) detected PPIDs. The ability of PAnDA to sample nearly all of the features detected in at least half of the analyses demonstrates the comprehensiveness of our approach and the potential to annotate most of the quantifiable signals in replicate label-free differential proteomics experiments10.
We also plotted the frequency of detected PPIDs sampled by MS/MS relative to the log signal intensity (Figure 3B). As expected, PAnDA lowered the mean intensity of peptide features sampled for MS/MS by nearly an order of magnitude when compared to using standard DDA. The increase in sampling of peptide ions of lower abundance resulted in greater use of the dynamic range of Orbitrap mass analyzer by forcing the LTQ mass analyzer to sample lower intensity signals in the presence of coeluting species of greater intensity.
When comparing the replicate µLC-MS/MS runs acquired using PAnDA to prioritize the selection of peptide precursor ions in subsequent technical replicates, there was a significant increase in the number of peptide and protein identifications compared with standard DDA (Figure 4). The use of PAnDA resulted in a 30.9% increase (3,849 vs. 2,941) in peptide identifications returned by database searching over the standard repeated analysis by µLC-MS/MS without specifying inclusion lists of specific features to be sampled. This increase in peptide identifications translated to a 20.5% increase (1,059 vs. 879) in the proteins identified in the sample. The greatest performance increase was observed after just the second iteration (the first technical replicate), with an increase of 929 additional peptides when using an inclusion list versus 422 additional peptides without the list. While most of the proteins identified by DDA were also identified by PAnDA, a fraction of the total identified proteins (13.2%) were unique to DDA (Figure 5).
Despite the flexibility and automation of data-dependent acquisition, instrument limitations prevent complete sampling of the detectible molecular species in a complex mixture. The improvement in MS/MS scan speed has improved the number of peptides that can be sampled by DDA11;12 but this increase in speed is still far from comprehensive and leaves significant uncertainty over what fraction of the detected features have been sampled by MS/MS. Furthermore, data-dependent acquisition prioritizes the sampling of the most abundant features, as opposed to lower abundant and possibly most interesting signals.
We describe the use of inclusion mass lists computed from the prior analyses to direct the sampling of MS/MS in replicate analyses. We have shown how PAnDA can direct data-dependent acquisition towards ions missed in previous analyses. However, because Hardklör can detect unusual isotope distributions and records the signal intensity for each persistent peptide isotope distribution4, PAnDA can also be used to direct MS/MS to features that are different between samples or to features that have been labeled to create an unusual isotope distribution4;13;14. The peak capacity and resolution of Fourier transform mass analyzers can resolve more individual components than can be sampled. PAnDA uses the high resolution data to prioritize peptide ions for MS/MS using an inclusion list. Thus, we can increase the sampling of detectable molecular species in a complex mixture through the technical and biological replicates that are required for any comparative analyses.
Although PAnDA increased the total number of proteins identified, a fraction of proteins (161 of 1220, Figure 5) were only identified in the repeated DDA analyses and not by PAnDA. This result is likely because of the variability in the MS/MS spectra for a given peptide and, thus, not every acquired MS/MS spectrum is of high quality. In fact, Venable et al. showed that when a single purified peptide was infused and continuously sampled by MS/MS during the infusion, only a fraction of the MS/MS spectra were correctly assigned by database searching15. However, when the results of multiple independent searches were combined, the identification of the correct peptide sequence improved15. Because DDA tends to resample the same peptides repeatedly, DDA increases the chance of acquiring a suitable spectrum to identify the peptide in a subsequent analysis. Given enough sampling, the chances of obtaining a spectrum that can identify a “difficult” peptide increases. PAnDA intentionally minimizes the resampling of peptide ions, resulting in a fraction of the peptides that can be identified in replicate DDA analyses but are unidentifiable from a single spectrum using PAnDA. Future improvements will include the capability to allow features to remain on the inclusion list until they have been sampled a user defined number of times and/or a check of MS/MS spectrum quality16 before the feature is removed.
Another common approach to improve the fraction of peptides sampled by data-dependent acquisition is to fractionate the mixture on either the protein level prior to digestion17–19 or on the peptide level20–22 prior to µLC-MS/MS. This approach, although technically different than PAnDA, seeks to achieve the same goal: improved sampling of the molecular components of a complex mixture. PAnDA is not a substitute for sample fractionation and simplification. While PAnDA is only useful if detected molecular species go unsampled by MS/MS, sample fractionation can enrich for low abundance species of interest and minimize ion suppression from high abundance interfering peptides. Of course PAnDA analysis can be combined with fractionation methods to further increase sample coverage if there is time and sample to perform replicate analyses.
The benefits of PAnDA are more than just increased peptide and protein identifications. While fractionation methods also increase peptide identifications, it also complicates the acquisition of technical and biological replicates as well as the quantitative comparison of the signal maps without stable isotope labeled internal standards between conditions and samples23. PAnDA increases the number of peptide identifications during the acquisition of technical and biological replicates and, thus, is ideally suited for a simpler analysis that relies on the peak capacity of the Fourier transform mass analyzer to separate peptides and not biochemical fractionation. The use of inclusion mass lists minimizes the overlap of sampled features between replicates while using a simpler, more consistent chromatography that is more robust for performing quantitative comparisons10;23–26. Thus, the features in the high resolution Fourier transform mass spectra are used to draw comparisons between samples and the MS/MS spectra acquired using PAnDA are used to improve the fraction of the features that can be annotated with a peptide sequence.
Our approach is similar to a method recently reported by Schmidt et al.27 However, a major difference in our method from that of Schmidt et al. is that it is implemented using the Hardklör software infrastructure, an algorithm that we4 and others28 have demonstrated has not only high sensitivity and accuracy but also is computationally efficient and fast. This speed makes implementation of PAnDA with Hardklör feasible in real-time with minimal computational overhead on a modest desktop computer. PAnDA is applied to the first DDA analysis to generate an inclusion mass list, and additional PAnDA replicates are used to dynamically adjust the list to maximize sample coverage. This eliminates the need to perform replicate DDA analysis prior to building an inclusion mass list of the most replicated signals. Furthermore, PAnDA takes advantage of the latest features in the Xcalibur instrument software, which makes possible the inclusion of thousands of m/z values in a single mass list without the need for a complex, segmented run method. While only 500 m/z values can be handled by the on-board digital signal processor at a time (personal communication M. Senko), the use of narrow retention time windows in the inclusion list keeps the number of peaks handled by the on board computer and firmware to a minimum.
Another difference between the two approaches is that our inclusion lists are always built with the most intense base isotope peak and not the monoisotopic peak. For peptide sequences above ~1,200 da the monoisotopic peak is not the most intense isotope peak. Thus, for low intensity peptide signals where the entire isotope distribution might not be detectable, we decided to err on placing the most intense isotope peak in the inclusion list as opposed to the lowest m/z isotope peak to increase the chance of sampling. Furthermore, because the isolation and fragmentation is performed in the low resolution LTQ mass analyzer we hoped to isolate and fragment the entire isotope distribution to ensure the sensitivity of the low resolution product ions spectrum29–31.
A potential alternative to inclusion mass lists is exclusion mass lists32. Similar in concept, exclusion mass lists prevent fragmentation of previously selected ions. Our software can also easily output PAnDA exclusion lists as an alternative to, or in addition to, inclusion lists. However, to prevent the isolation and sampling of neighboring isotope peaks by the data-dependent acquisition, wide mass windows need to be used that exclude the entire isotope distribution. This use of wide isolation windows minimizes much of the benefit of ultra-high resolution mass analyzers because many of the detected persistent isotope distributions are overlapping in complex mixtures4;33. An alternative to excluding a wide m/z window would be to exclude all isotope peaks individually with narrow windows for a detected feature that was fragmented in a prior analysis. The downside of excluding all individual isotope peaks is that this would increase the size of the exclusion list five- to seven-fold, which will easily become too large to handle by the instrument digital signal processor. With these caveats in mind, careful use of combined inclusion and exclusion mass lists may provide the most efficient method of sample coverage.
The availability of commercially available hybrid mass spectrometers that combine the high resolution and dynamic range of a Fourier transform mass analyzer with a highly sensitive and fast scanning low resolution ion trap is a powerful combination for shotgun proteomics. Unfortunately, the scan speed of data-dependent acquisition is incapable of sampling all of the detectable peptide like signals in a µLC-MS using an Orbitrap or FT-ICR mass analyzer. Assuming that all comparative proteomics experiments will involve the use of replicated analyses, we have developed methodology and software that can increase the fraction of detectable signals in the high resolution mass analyzer that can be assigned a peptide identity in a relatively short one-dimensional chromatographic analysis. We have demonstrated that PAnDA improves the identification of peptides and proteins relative to standard DDA and anticipate that this general analysis scheme will become more widely adopted in the future. To aid in this adoption we have automated the entire process using custom in house developed software.
We would like to thank Drs. Michael Senko and Amol Prakash of Thermo Fisher Scientific for helpful discussions on how the inclusion lists are used by the instrument firmware and suggestions on how to integrate PAnDA with the Xcalibur data-system. We also appreciate the helpful comments of Daniela Tomazela and Jesse Canterbury of the MacCoss lab. Support for this work was provided in part by National Institutes of Health grants P41 RR011823 and R01 DK069386 and by the University of Washington's Proteomics Resource.