In this study, we systematically evaluated the relative merits of repetitive LC-MS/MS runs compared with introduction of an additional protein level separation step for increasing proteome coverage of cancer cell lysates. Factors that affect apparent reproducibility between proteome analyses performed on the same sample also were evaluated. The basic analysis platform used for repetitive analyses was the commonly utilized GeLC-MS/MS method, which can be considered to be a 2-D proteomics method as it involves two dimensions of separation, that is, protein separation using SDS-PAGE and reverse-phase HPLC separation of tryptic peptides. This method was compared to a 3-D method consisting of solution IEF at the protein level followed by the GeLC-MS/MS method.
It is important when comparing alternative analysis platforms to consider the total number of LC-MS/MS runs per proteome, because improved proteome coverage typically can be achieved by lengthening the HPLC gradient or by repeating LC-MS/MS analysis of complex samples.9,10
Similarly, many separation modes prior to LC-MS/MS can be at least incrementally improved by simply increasing the number of fractions collected, provided that the resolution of the separation method exceeds the initial fraction size used. But in some cases, increases in protein coverage may be too small to be considered advantageous when total analysis time per proteome is considered. Hence, evaluation of the merits of greater depth of analysis, particularly small improvements, must be constantly weighted relative to overall throughput. Furthermore, total mass spectrometer instrument time frequently is the limiting resource, and when fractionation prior to the LC-MS/MS step is used, it is the rate-limiting step in proteome analysis throughput. Hence, the most meaningful comparisons are those where the total mass spectrometer analysis time per proteome is held constant. In this study, we used a consistent gradient time and 80 LC-MS/MS runs for both the 3-D method and the 2-D/repetitive-run method (four repeat injections). Similarly, all other experimental variables were held as constant as possible, including use of replicate aliquots of a single cell lysate preparation, gel separation lengths, gel volumes per trypsin digestion reaction, instrument tuning, and data-analysis methods. Our goal was to determine, quantitatively, which method represents the more efficient utilization of mass spectrometer time when analyzing complex proteomes.
Robust proteome analysis methods should be reproducible in addition to identifying the majority of proteins present in a biological sample. One major cause of variations in proteins identified in replicate analyses of the same proteome is undersampling in the mass spectrometer, as discussed above. Therefore, high proteome coverage should be linked to good reproducibility of proteome analysis results because extensive proteome coverage will occur only if undersampling is minimized. A second factor that will contribute to poor reproducibility between proteome protein lists is use of data filtering conditions that result in high false-peptide and protein-identification rates, since false positives usually are random. Hence, data filtering stringency is another tradeoff that must be considered when selecting a proteome analysis strategy. While low stringency filters contribute to noise and low reproducibility, excessively stringent filters will greatly diminish the number of protein identifications and hence the value of the experiment. In the current study, data filters were used that yielded peptide false-positive rates between 1 and 2% as estimated using a decoy reverse database, thereby minimizing apparent poor reproducibility between data sets. This level of stringency results in very few false positives for proteins identified by two or more peptides and, while there are some false positives within the one-hit protein list, a majority of these identifications are correct.
The repetitive analyses of 2-D data showed increased peptide and protein counts (Figure ) indicative of undersampling in the basic GeLC-MS/MS method used here. The overall gain from four repetitive analyses of proteins identified by two or more peptides was 1061 (59%) compared to the initial single analysis. A similar increase (61%) was observed in repetitive MudPIT using nine analyses.(10
) As expected, the greatest positive impact on proteome coverage was use of a second replicate run, which increased the number of proteins identified by two or more peptides by 33% while doubling instrument time. In contrast, adding a third and fourth replicate only increased protein coverage by 13% and 6%, respectively. These data indicate that performing a second analysis of each fraction when using GeLC-MS/MS would be a positive tradeoff between instrument time and protein coverage. However, further doubling of instrument time by performing four repetitive runs is unlikely to represent optimal use of instrument time for most types of experiments.
Of course, an alternative to performing a repetitive analysis of gel fractions would be to obtain more slices per gel lane. In analogous experiments where gel lanes were divided into 40 or 60 fractions, we observed increases in the number of proteins identified that were similar to those obtained in this study for duplicate and triplicate analyses of the 20 fractions per gel lane (data not shown). Although producing more fractions per gel lane increases the number of in-gel digestions, the overall increase in total analysis time for a proteome is minor. Hence, we generally prefer to use more fractions per lane rather than to replicate analyses when greater depth of analysis is desired using GeLC-MS/MS experiments. Longer gels and larger numbers of fractions per lane were not used in the current study because we wanted to keep total mass spectrometer time per proteome (approximately 160 h per proteome) within practical limits while simultaneously matching gel lengths, gel volumes, and other parameters. That is, extrapolating from other 2-D and 3-D experiments that we have performed, we expect that the total number of proteins for each data set (2-D, 2-D/repetitive runs, and 3-D) would have increased moderately if we would have used 40 or 60 slices per gel lane for all samples. But use of 40 or 60 fractions per gel lane would have increased total instrument time to about 320 and 480 h per proteome, which represents an impractically low throughput for most studies. Interestingly, as we increase the number of fractions per gel lane to 40 or 60 fractions, the incremental increases in new proteins identified diminish, analogous to the diminishing benefits of adding each additional replicate in the repetitive-run approach (Figure A). Although similar trends are observed for these two approaches, the mechanisms for increasing protein coverage are quite different. That is, using a larger number of gel fractions increases protein separation and simplifies the mixture of proteins present in each fraction, while repetitive runs exploit subtle variations in peptide separations in replicate HPLC runs and subtle variations in data-dependent selection of low-level ions for MS/MS fragmentation and analysis.
The 3-D method clearly provided superior protein and peptide coverage compared with the 2-D/repetitive method, which indicates that adding an additional protein separation step represents a more efficient use of mass spectrometer instrument time. This method identified 3486 proteins with two or more peptides, which is 22% more than the 2-D/repetitive method that used equal instrument time. At the peptide level, the 3-D method identified 30
385 high-confidence, nonredundant peptides, which is nearly 2.5 times more than that found in a single survey using the 2-D method (12
160) and 28% more than the cumulative count in four repetitive analyses (23
648). Furthermore, more unique peptides were found for most low-abundance proteins in the 3-D method data compared with the cumulative 2-D method data (Figure ).
It is not surprising that adding solution IEF as an additional orthogonal protein separation step to a GeLC-MS/MS method is an efficient strategy for increasing proteome coverage and sequence coverage of lower-abundance proteins. MicroSol IEF separates the proteins that would normally be in a single gel slice into four gel slices (see Figure ). Consequently, full-scan spectra were simplified, thereby minimizing undersampling. In addition, the simpler samples should decrease ion suppression effects and reduce dynamic range within each digest. Finally, in some cases, improved scores for MS2 spectra in the 3-D method probably resulted from a lower probability of interfering ions being isolated with the target ion for fragmentation. Thus, more peptides passed the data-filtering criteria as true positive identifications. While the repetitive analysis strategy also improved proteome coverage, it had neither a built-in mechanism to reduce repeated sampling of abundant ions between replicates, nor could it explore the ions below the MS2 triggering threshold. An alterative technique that has sometimes been used to improve replicate runs is to scan different mass ranges in each replicate. However, pilot experiments suggested that this approach was less productive than the simple repetitive analysis method used here.
One frequent criticism of proteomics methods is that the proteins identified on repeat analyses often are not very reproducible. A recent study suggested good reproducibility was achievable across 27 laboratories on a simple 20-protein mixture after uniform data processing was used.(35
) But this simple sample of abundant proteins at the same concentration does not reflect real biological complexity. Hence, in the current study, we compared the reproducibility between different analysis methods using a very complex sample of biological interest, that is, a human cancer cell lysate. Among four replicate analyses of the 2-D samples, 1500−1600 proteins were shared between them (Supplemental Figure 1
). This indicated at least 76% of the proteins observed in one analysis were reproducibly detected despite significant undersampling. More importantly, at least 90% of the proteins observed in the 2-D/four-replicate data set based on two or more peptides directly matched to a corresponding protein 3-D data set, and most of the apparent mismatches were caused by trivial data analysis issues.
A more rigorous comparison of the two comprehensive data sets showed that greater than 96% of the proteins identified in the 2-D/repetitive-run proteome were actually observed within the complete 3-D data set. One reason for the initially apparent, lower reproducibility when protein names were compared was slight variations in peptides scores together with use of rigid data filter cutoff values (see above and Figure ). That is, the 10% of proteins that were apparently unique to the 2-D/repetitive-run data set included 184 proteins (6.4% of the 2-D/repetitive protein list) that were identified by a single peptide in the 3-D data set. Reasons why these proteins were only identified by a single peptide in the 3-D data set include run-to-run variations in automated selection of low-abundance signals for MS/MS and run-to-run variations in SEQUEST scores coupled with use of rigid data filters. The latter case appears to occur frequently as described above. A second contributing factor to the initially apparent, lower reproducibility at the protein list level is database redundancy and limitations of current software for consistently producing consensus protein lists from identified peptides. Of the 111 proteins apparently unique to the 2-D/repetitive-run data set and not identified by a single-hit protein in the 3-D data set, most were highly homologous to proteins identified in the 3-D data set (see Results
and Supplemental Table 1
). Among the 385 peptides belonging to the 111 unique proteins in the 2-D data, only 159 peptides were not found in the 3-D filtered data. Although these 111 unique proteins comprised 4% of all proteins identified, the 159 unique peptides were only 0.7% of all 2-D filtered peptides. This illustrates that very small variations in identified peptides can have a proportionally higher “apparent” impact on variations in identified proteins. Since we used each unique peptide a single time during assembly of consensus protein lists, the common sequences were assigned to the protein with the most unique sequences. Consequently, for a group of proteins that have high sequence identity, one or two unique peptides could determine which protein in the protein family emerged in the final consensus protein list. This illustrates that better software tools are needed for identifying and displaying putative unique proteins within protein families. Similarly, improved databases with uniform names or other labels that clearly indicate membership within a protein family would be beneficial.
In conclusion, additional prefractionation with MicroSol IEF substantially increased proteome coverage and sequence coverage compared with a GeLC-MS/MS-repetitive run method that utilized an equal amount of mass spectrometer time. Furthermore, the reproducibility of protein lists between the two methods was quite high because undersampling during data acquisition had been minimized. Most of the apparent differences in protein identifications were due to limitations of current sequence databases and protein naming conventions, as well as software limitations for filtering database search results and building consensus protein lists.