Overview of CPTAC interlaboratory studies
The studies undertaken by the CPTAC network reflect the challenges of conducting interlaboratory investigations (see ). Researchers at NIST produced a reference mixture of 20 human proteins at varying concentrations (NCI-20). Study 1 saw the distribution of sample to each participating laboratory; the teams were asked to identify the proteins using their own protocols with any available instruments. Study 2 established SOP 1.0 for both LC and MS/MS configuration and employed only Thermo LTQ and Orbitrap instruments. Substantial changes took place between Study 2 and Study 5; a new
Saccharomyces cerevisiae (yeast) reference proteome was introduced
28, the SOP version 2.0 was developed, and a bioinformatic infrastructure was established to collect raw data files and to identify peptides and proteins. (Study 3 tested these tools in a small-scale methodology test, whereas Studies 4 and 7 were part of a parallel CPTAC network effort directed at liquid chromatography-multiple reaction monitoring-mass spectrometry for targeted peptide quantitation
29; none of these studies are considered further in this work.) Study 5 analyzed the yeast reference proteome under the new SOP. Study 6 built upon Study 5 by including spikes of the Sigma UPS 1. The inclusion of NCI-20 in all of these studies enabled the measurement of variability for this sample in a variety of SOP versions and experimental designs. Study 8 employed no SOP; individual laboratories used their own protocols in analyzing two sample loads of the yeast lysate. The same quantity of yeast was analyzed in Studies 5, 6 and 8, with an additional 5x (high load) sample analyzed in Study 8. Study 5 produced six replicates of the yeast, whereas the other studies generated triplicates. The numbers of identified spectra were consistent for individual instruments, but more variable among multiple instruments. These yeast data sets, along with the quintuplicate NCI-20 analyses from Study 5 and the triplicate Sigma UPS 1 analyses from Study 6— a total of 144 LC-MS/MS analyses from four different LTQs and four different Orbitraps— comprise the essential corpus for our analysis of repeatability and reproducibility.
Bioinformatic variability in Orbitrap data handling
Initially, the MS/MS spectra produced in these studies were identified by database searches configured for low mass accuracy, because we expected significant differences in instrument tuning among the sites. Precursor ions were required to fall within 0.100 m/z of the values expected from peptide sequences (MyriMatch did not support configuring precursor tolerance in ppm units at the time of data analysis). However, we also recognized that higher mass accuracy of Orbitrap instruments could enable tighter mass tolerances for Orbitrap data sets. We tested MyriMatch over a range of precursor tolerances from ±0.004 to ±0.012 m/z () in searches of the Orbitrap analyses of yeast from Study 6. We also included the original ±0.100 m/z setting as well as a tolerance of ±1.250 m/z, centered on the average peptide mass, which we used for searches of LTQ data. We observed an 8–16% increase in identification rate for the tight tolerances versus the original search settings and a 9–15% increase compared to the ±1.250 m/z tolerance used for the LTQ. The data indicate that the high mass accuracy of the Orbitrap produces a moderate increase in identifications, but only when precursor mass tolerance for identification is optimized. Based on these data, we used ±0.007 m/z as the precursor tolerance for all subsequent Orbitrap searches in this paper.
Other aspects of these instruments can also impact the effectiveness of this peptide identification. We examined the number of peaks recorded per tandem mass spectrum from each of the four instruments as a measure of variability remaining after SOP refinement. Panel 2B reveals significant differences in peak count interquartile range among these instruments. These differences may reflect differences in electron multipliers, low levels of source or instrument contamination, or instrument idiosyncrasies. Many search engines make allowance for high resolution instruments mis-reporting the monoisotope for a precursor ion. We conducted the MyriMatch database search with and without the option to add or subtract one neutron from the computed precursor mass, and Panel 2C shows the extent to which this feature improved identification rates. OrbiP@65 benefited disproportionately from this feature. Of all twelve files, only one replicate from Orbi@86 failed to benefit from allowing the addition or subtraction of a neutron. Differences in peak density per spectrum and monoisotope selection can both influence peptide identification.
Many studies have shown that database search engines produce limited overlap in the peptides they identify, and the comparison shown in Panel 2D is no exception. We repeated the database search for these twelve files in X!Tandem as described in Materials and Methods. The number of identifications produced by X!Tandem typically fell within 3% of the number produced by MyriMatch, but Panel 2D shows an average overlap of only 71% between the peptide sequences from each identification algorithm for each LC-MS/MS experiment. Oddly, the degree of overlap appeared to be higher for the two Orbitraps at site 65 than for the others, though this would not appear to reflect tuning similarities, given . Though bioinformatics pipelines are perfectly repeatable for a given LC-MS/MS file, the choice of search engine and configuration clearly can have a tremendous impact on the identifications produced for a given data set.
Variability and repeatability in identified MS/MS spectra between instruments
depicts the numbers of MS/MS spectra mapped to yeast sequences in Studies 5, 6, and 8. At a glance, suggests that strong instrument-dependent effects are observed in the identifications. The numbers of raw MS/MS spectra produced by LTQ instruments were approximately double the numbers of MS/MS spectra produced by Orbitraps due to the use of the charge state exclusion feature in the Orbitraps (data not shown). The numbers of MS/MS spectra that could be confidently matched to peptide sequences, however, were quite similar between instrument classes. Study 6 was most suggestive of differences in numbers of identifications between LTQs and Orbitraps (), and yet even for this case the p-value (0.058) was insignificant (Student’s t-test using unpaired samples, two-sided outcomes, and unequal variances). While the fraction of raw spectra that were identified successfully was much higher for Orbitraps than for LTQs, the overall numbers of identifications between instrument classes were comparable.
The full set of samples extended beyond the yeast lysate. shows the numbers of identified spectra, distinct peptides, and different proteins from each instrument from each of the studies. Spectra have been summed across all replicates for each instrument. Peptides were distinct if they differed in sequence or chemical modifications, but they were considered indistinct if they differed only by precursor charge. Proteins were only counted as identified if they matched multiple distinct peptides for a particular instrument. The data demonstrate that the NCI20 and Sigma UPS1 defined mixtures produced an order of magnitude fewer identified spectra and distinct peptides than did the more complex yeast sample. In both of these samples, contaminating proteins (e.g. keratins, trypsin) sometimes caused instruments to appear to have identified more proteins from these defined mixtures than the pure samples contained.
| Table 1Identification counts for all included samples |
Laboratories employing LC-MS/MS with data-dependent acquisition of MS/MS spectra expect identification variability. If peptides from a single digestion are separated on the same HPLC column twice, variations in retention times for peptides will alter the particular mix of peptides eluting from that column at a given time. These differences, in turn, impact the observed intensities for peptide ions in MS scans and influence which peptide ions will be selected for fragmentation and tandem mass spectrum collection. Spectra for a particular peptide, in turn, may differ significantly in signal-to-noise, causing some to be misidentified or to be scored so poorly as to be eliminated during protein assembly. All of these factors diminish the expected overlap in peptide identifications among replicate LC-MS/MS analyses.
Because Orbitraps targeted peptides for fragmentation on the basis of higher resolution MS scans than did the LTQs, we asked whether Orbitrap peptide identifications were more repeatable. Repeatability for each instrument was measured by determining the percentage of peptide overlap between each possible pair of technical replicates. The six replicates of yeast in Study 5 enabled fifteen such comparisons for each instrument, while the five replicates of NCI-20 yielded ten comparisons. In Studies 6 and 8, triplicates enabled only three comparisons per instrument (A vs. B, A vs. C, and B vs. C).
reports the pair-wise repeatability for peptide identifications in Studies 5, 6, and 8. By including six replicates of yeast and five replicates of NCI-20, Study 5 yielded the most information for comparison among technical replicates (panels 4A and 4B); Student’s t-test produced a p-value of 0.035 in comparing LTQ to Orbitraps in Study 5 yeast peptide repeatability (using unpaired samples, two-sided outcomes, and unequal variances). Significant differences were not observed in Study 6 analyses of yeast peptides (p = 0.057, panel 4C); this anomaly is traceable to the low repeatability (39%) observed for the Orbitrap at site 86, a set of runs that also suffered from low sensitivity (). Yeast data from Study 8 (panels 4E and 4F) differentiated instrument classes with a p-value of 0.027. The averages of medians for LTQ yeast peptide repeatability were 36%, 38%, and 44% in Studies 5, 6, and 8, respectively. The corresponding values for Orbitraps were 54%, 47%, and 59%. Orbitrap instruments achieved 9–18% greater peptide repeatability than LTQs for the yeast samples.
Identification repeatability and sample complexity
We hypothesized that increased sample complexity would decrease peptide repeatability. A complex mixture such as yeast should yield a more diverse peptide mixture than the 48 proteins of Sigma UPS 1, which in turn should yield a more diverse peptide mixture than the simple NCI-20 sample.
This hypothesis can be tested by returning to . On average, the simple NCI-20 yielded a median 44% overlap in the peptides identified between pairs of replicates (). The Sigma UPS 1 produced almost the same average— 46% overlap for peptides (). Yeast repeatability fell between the two, giving a 45% median overlap (). The interquartile ranges across all instruments for these three mixtures were 39–50% for NCI-20, 42–48% for Sigma UPS 1, and 36–54% for yeast. These similar overlaps are all the more striking when one considers the numbers of peptides observed for each sample. In total, 977 different peptide sequences (in 7755 spectra) were identified from the NCI-20 (this includes a number from contaminant proteins), and the Sigma UPS 1 generated 1292 peptides from 9550 identified spectra. The yeast, on the other hand, produced 14,969 peptides from 130,268 identified spectra. These data demonstrate that repeatability in peptide identification is independent of the complexity of the sample mixture and is robust against significant changes in the numbers of identifications.
Observing these consistent but low repeatability values conveys a key message: all digestions of protein mixtures are complex at the peptide level. Clearly NCI-20 and yeast are widely separated on a scale of protein complexity. The peptides resulting from both digestions, however, are still diverse enough to preclude comprehensive identification in a single LC-MS/MS. The peptides identified from NCI-20 included numerous semi-tryptic peptides in addition to the canonical tryptic peptides. Likewise, the concentration of a sample may change the number of identified spectra dramatically without changing the fractional overlap between successive replicates. Peptide identification repeatability may be a better metric for judging the particular instruments or analytical separations than for particular samples.
While peptide repeatability was essentially unchanged in response to sample complexity, protein repeatability appeared different for the equimolar Sigma UPS 1 and complex yeast samples (open boxes, ). Student’s t-test, comparing the protein repeatabilities for yeast and Sigma UPS 1 in each instrument, produced p-values below 0.05 for the LTQ@73, LTQx@65, OrbiO@65, and OrbiP@65, but the p-values for LTQ2@95, Orbi@86, and OrbiW@56 were all in excess of 0.10 (unpaired samples, two-sided outcomes, and unequal variances). The wide spread of protein repeatability observed for Sigma UPS 1 in prevented a strong claim that protein repeatability differed between these two samples.
Identification repeatability and sample load
A similar result appears when high concentrations of yeast are compared to low concentrations. Study 8 differs from the others in that each laboratory employed lab-specific protocols rather than a common SOP. , panels E–F display the peptide and protein repeatability for both high and low sample loads of yeast. A five-fold increase in sample on-column increased identified peptide counts an average of 48% per instrument, whereas protein counts increased by an average of 35%. The median peptide repeatability, however, was essentially identical between the low and high loads, both with median values of 53% (with an interquartile range of 43–58% for low load and 44–58% for high load). The repeatability for proteins was always higher than for peptides in Studies 6 and 8, but it, too, was unaffected by protein concentration. The median value for protein repeatability at the low load was 76%, while the median for the high load was 75%. The stability of repeatability between low and high concentrations helps reinforce the findings for Studies 5 and 6, which used the same sample load as the low concentration in Study 8. The 120 ng load was intended to be high enough for good identification rates but low enough to forestall significant carryover between replicates. Though larger numbers of identifications resulted from a higher sample load, repeatability for peptides and proteins was unchanged by the difference in concentration.
Study 8 also provided an opportunity to examine the problem of peptide oversampling in response to sample load. All instruments in this study were configured to employ the “Dynamic Exclusion” feature to reduce the extent to which multiple tandem mass spectra were collected from each ion, though the specific configurations differed. For each replicate in each instrument, we computed the fraction of peptides for which multiple spectra were collected. Several possible reasons can account for the collection of multiple spectra: each precursor ion may appear at multiple charge states, dominant peptides may produce wide chromatographic peaks, or a different isotope of a peptide may be selected for fragmentation. In LTQs, an average of 15.3% of peptides matched to multiple scans for each replicate of the low concentration samples, while this statistic was 18.5% for the high concentration samples. Orbitraps were more robust against collecting redundant identifications, with 8.8% of peptides matching to multiple spectra in the low concentration samples and 11.5% matching to multiples in the high concentration samples. Across all instruments, the five-fold increase in sample concentration corresponded to 2.9% more peptides matching to multiple spectra. This difference produced significant p-values (below 0.05) for all but one of the instruments by t-test (using unpaired samples, allowing for two sided outcomes, and assuming unequal variances). Although more spectra were identifiable when sample concentration increased, the redundancy of these identifications was also higher.
Properties that influence repeatability in peptide identification
Given that peptide-level repeatability rarely approached 60%, it may seem that the appearance of peptides across replicates is largely hit-or-miss. In fact, some peptides are far more repeatable than others. We examined three factors for their correlation with peptide repeatability: trypsin specificity, peptide ion intensity, and protein of origin.
Many laboratories routinely consider only fully-tryptic peptide identifications in database searches, expecting only a minimal amount of nonspecific enzymatic cleavage and in-source fragmentation in their samples. Allowance for semi-tryptic matches (peptides that match trypsin specificity only at one terminus) has been shown to improve identification yield
23, 30. In Study 6, semi-tryptic peptides constituted 2.2% of the identifications from yeast per instrument and 16.9% of the peptides identified from the Sigma UPS 1 sample. In both samples, semi-tryptic peptides were less likely to appear in multiple replicates than fully tryptic peptides (). In yeast, an average of 45% of the fully tryptic peptides appeared in only one replicate, but 62% of the semi-tryptic peptides appeared in only one replicate. Comparing percentages by instrument produced a p-value of 0.000763 by paired, two-sided t-test. A similar trend appeared in Sigma UPS 1, with 38% of fully tryptic peptides appearing in only one replicate and 65% of semi-tryptic peptides appearing in only one replicate (p=0.00014). Although these two samples produced different percentages of semi-tryptic peptide identifications, both results are evidence that semi-tryptics are less repeatable in discovery experiments.
Precursor ion intensity drives selection for MS/MS and would be expected to correlate with repeatability of peptide identifications. We analyzed the five replicates of NCI-20 from Study 5 to measure this relationship. Peptides from each instrument were separated into classes by the number of replicates in which they were identified, and the maximum intensity of each tryptic precursor ion was computed from the MS scans (see Methods). Box plots were used to summarize the results (see ). For both the LTQs and Orbitraps, repeatability positively correlated with precursor intensity, with peptides identified in only one replicate yielding much lower intensities. We also observed that low intensity peptides were less reproducible across instruments (data not shown). As expected, intense ions are more consistently identified from MS/MS scans.
The protein of origin may also impact the repeatability of a peptide. Intuitively, a peptide that results from the digestion of a major protein is more likely to repeat across replicates than a peptide from a minor protein. In this analysis, the overall number of distinct peptide sequences observed for each protein was used to segregate them into seven classes. Proteins with only one peptide observed comprised the most minor class. The number of peptides required for each successive higher class was doubled, yielding classes of proteins identified by 2 peptides, 3–4 peptides, 5–8 peptides, 9–16 peptides, and 17–32 peptides. Proteins with more than 32 peptides were classed as major proteins. For each instrument in Study 5, observed yeast peptides were split into classes by the number of replicates (out of six) in which they appeared. The graphs in show how protein class corresponds to peptide repeatability.
On average, 40% of peptides appearing in all six replicates matched to the major proteins (more than 32 distinct peptide sequences). At the other extreme, peptides appearing in only one replicate matched to major proteins only 18% of the time. Almost none of the peptides that were sole evidence for a protein were observed in all six replicates, but 13% of the peptides observed in only one replicate were of this class, and only 3% of the peptides observed in two of the six replicates were sole evidence for a protein. These data are consistent with a model in which digestion of any protein produces peptides with both high and low probabilities of detection; peptides for a major protein may be observed in the first analysis of the sample, but additional peptides from the same protein will be identified in subsequent analyses. The highest probability peptides from minor proteins must compete for detection with less detectable peptides from major proteins.
Factors governing peptide and protein identification reproducibility
The analyses of repeatability described above establish the level of variation among technical replicates, but they do not address the reproducibility of shotgun proteomics across multiple laboratories and instruments. What degree of variability should be expected for analyses of the same samples across multiple instruments? Studies 6 and 8 provide data to address this question. Study 6 was conducted under SOP version 2.2, with a comprehensive set of specified components, operating procedures, and instrument settings shared across laboratories and instruments. Study 8, on the other hand, was performed with substantial variations in chromatography (e.g. different column diameters, self-packed or monolithic columns and various gradients) and instrument configuration (e.g. varied dynamic exclusion, injection times, and MS/MS acquisitions). See
Supplementary Information for additional detail.
compares the yeast peptides and proteins identified from each replicate LC-MS/MS analysis on one instrument to all other analyses on other instruments of the same type in the same study. The peptide median of 30% for Study 6 LTQs, for example, indicates that typically 30% of the peptides identified from a single analysis on one LTQ were also identified in individual analyses on the other LTQs. As observed previously in repeatability, protein reproducibility was always higher than the corresponding peptide reproducibility. separates the Orbitrap at site 86 from the others in Study 6 because it was a clear outlier; all comparisons including this instrument yielded lower reproducibility than comparisons that excluded it. The remaining three Orbitraps in Study 6 were cross-compared to produce the Study 6 Orbitrap evaluation.
Surprisingly, the reproducibility observed with and without an SOP was unchanged. The median reproducibility insignificantly increased by 0.8% for LTQs from Study 6 to Study 8 and by 0.4% for Orbitraps. The inter-quartile range of reproducibility increased slightly for proteins (both LTQ and Orbitrap) and for peptides observed by Orbitraps. While the median reproducibility was unaffected by the SOP, the range of reproducibility broadened when no SOP was employed. This test of reproducibility is limited in scope to include only the identifications resulting from these data; an examination of retention time reproducibility or elution order might reveal considerably more detail about the reproducibility gains achieved through this SOP. Because Study 8 followed the experiments incorporating the SOP, lab-specific protocols may have been altered to incorporate some elements of the SOP, thus diminishing any observable effect.
The comparison between LTQ and Orbitrap platforms shows two contrary phenomena. First, the Orbitrap at site 86 shows a potential disadvantage of these instruments; an Orbitrap that is not in top form can produce peptide identifications that do not reproduce well on other instruments. When this instrument is excluded, however, the remaining Orbitrap analyses were more reproducible at both the peptide and protein levels than were analyses on LTQ instruments. The difference in median reproducibility by instrument class ranged between 5.2% and 6.6%, depending on which study was analyzed and whether peptides or proteins were examined. Although Orbitraps can produce more reproducible identifications than LTQs, the difference is not large, and Orbitraps that are not operating in peak form lack this advantage.
Impact of repeatability on discrimination of proteomic differences
One of the most important applications of LC-MS/MS-based shotgun proteomics is to identify protein differences between different biological states. This approach is exemplified by the analysis of proteomes from tissue or biofluid samples that represent disease states, such as tumor tissues and normal controls. These studies typically compare multiple technical replicates for one sample type to technical replicates for a second sample type. Identified proteins that differ in spectral counts or intensities between samples may be considered biomarker candidates, which may be further evaluated in a biomarker development pipeline
31. Repeatability of the analyses is a critical determinant of the ability to distinguish true differences between sample types.
The repeatability analyses described above revealed that approximately half the peptides observed in LC-MS/MS analysis will be identified in a second replicate. This low degree of overlap implies that the total number of peptides identified grows rapidly for the first few replicates and then slows. A plot of this growth for the six yeast replicates and five NCI-20 replicates of Study 5 is presented in . The rates at which newly-identified peptides accumulated for NCI-20 and yeast replicates were indistinguishable. The first two replicates for NCI-20 identified an average of 41% more peptides per instrument than did the first replicate alone. Similarly, two replicates for yeast contributed an average of 38% more peptides than a single analysis. Three replicates identified 59% more peptides for NCI-20 than did the first replicate alone, whereas three replicates for yeast identified 63% more peptides than one replicate. The similarity of these trends implies that one cannot choose an appropriate number of replicates based on sample protein complexity alone. By collecting triplicates, researchers will, at a minimum, be able to estimate the extent to which new replicates improve the sampling of these complex mixtures.
In this analysis, independent technical replicates were evaluated for cumulative identifications. If maximizing distinct peptide identifications were the priority, however, one might instead minimize repeated identifications among replicates to as great an extent as possible. Some instrument vendors have implemented features to force subsequent replicates to avoid the ions sampled for tandem mass spectrometry in an initial run (such as RepeatIDA from Applied Biosystems). Researchers have also used accurate mass exclusion lists
32 or customized instrument method generation
33 to reduce the re-sampling of peptides in repeated LC-MS/MS analysis. The use of such strategies would lead to reduced repeatability for peptide ions and steeper gains in identifications from replicate to replicate.
Discrimination of proteomic differences between samples based on spectral count data requires relatively low variability in counts observed across replicates of the same samples. To characterize spectral count variation, we examined the counts observed for each protein across six replicates of yeast for each instrument in Study 5. Coefficients of variation (CVs), which compare the standard deviation to the mean for these counts, have two significant drawbacks for this purpose. First, they do not take into account the variation in overall sensitivity for some replicates (as illustrated by the spread of data points for each instrument in ). Second, CVs are generally highest for proteins that produce low spectral counts. We developed a statistic based on the multivariate hypergeometric (MVH) distribution instead (see Methods) that addressed both of these problems. In this approach, we compute the ratio between two probabilities. The first is the probability that spectral counts for a given protein would be distributed across replicates as actually observed, given the number of all identified spectra in each replicate. The second is the probability associated with the most likely distribution of these spectral counts across replicates. The ratio, expressed in natural log scale, asks how much less likely a particular distribution is than the most common distribution of spectral counts.
The results, illustrated in and and , show considerable spread in the probability ratios for each instrument. The highest value (i.e. least stable spectral counts) for a protein in LTQ@73, for example, is 5.29 on a natural log scale, indicating that the observed spectral count distribution for this protein was approximately 200 times less likely than the most equitable distribution of spectral counts (mostly due to the low spectral counts observed in replicates 5 and 6). The proteins with extreme MVH scores for each instrument reflect that examining spectral counts for large numbers of proteins invariably reveals a set of proteins with uneven spectral counts, even when the sample is unchanged. This phenomenon reflects the need for multiple-testing correction in the statistical techniques employed for spectral count differentiation.
| Table 2Study 5 yeast protein extreme spectral count variation |
| Table 3Study 5 yeast protein median spectral count variation |
The median cases for these scores are examples of routine disparities in spectral counts. The medians for Orbitrap instruments were approximately half of the LTQ medians (Student’s t-Test p-value=0.00617, using unpaired samples, two-sided outcomes, and unequal variances); spectral counts vary more in LTQ instruments. As a result, biomarker studies employing LTQ spectrum counts need to include more replicates than those employing Orbitraps to achieve the same degree of differentiation.