Targeted proteomics is a powerful approach that enables quantitative analysis of tryptic peptides from complex biological samples with high sensitivity and specificity1,2
. However, a major bottleneck limiting wider application of targeted proteomics has been the identification of optimal proteotypic peptides that are readily detectable by the mass spectrometer, as well as the characteristic fragmentation patterns of these peptides.
Because of differences in physiochemical properties, different peptides from the same protein can produce drastically different signal intensities when measured with a mass spectrometer1
. Peptides are referred to as ‘proteotypic’ if they (i) are unique to a given protein, (ii) have good response characteristics in the mass spectrometer, and (iii) have a fragmentation pattern with salient features to accurately detect and quantify. Traditional strategies for identifying proteotypic peptides and their fragmentation patterns have relied on the combination of experimental data with bioinformatic analyses. A common approach has been to use peptides catalogued in the course of ‘shotgun’ proteomic experiments conducted by data-dependent acquisition3,4
. This approach assumes that the peptides most frequently identified in shotgun experiments will produce the best response in a targeted proteomics setting. This assumption also underlies the application of machine learning methods, which aim to predict proteotypic peptides (but not their fragmentation spectra) de novo5,6
. Complicating these efforts, a large subset of the human proteome is absent from fragmentation spectra databases, and this deficit is particularly acute for low abundance proteins such as transcription factors and kinases. To generate such peptide fragmentation data, large-scale efforts aim to synthesize predicted proteotypic peptides and empirically determine their fragmentation patterns7
. However, which, if any, of these approaches is best suited for sensitive targeted proteomic analyses is unknown.
Here we report an empirically-driven approach for generating both optimal proteotypic peptides and their fragmentation patterns in a scalable, economical, and generalizable fashion. Rather than relying on sparsely populated spectral databases3,4
, prediction algorithms5,6
, costly peptide synthesis7
or the costly purchase of full-length proteins8
, we leveraged the rich collection of tagged cDNA clones that are currently available for most human and model organism proteins9,10
to generate in vitro
-synthesized full-length protein samples, followed by tryptic digestion and mass spectrometry analysis using selected reaction monitoring (SRM). Because all monitored tryptic peptides for each protein originate from the same full-length protein molecules, we are able to compare the relative intensities of different peptides to identify those that provide the most sensitive proxy for the target protein. In addition to the relative peptide response, we are able to identify in parallel the fragmentation patterns of these peptides in a triple-quadruple mass spectrometer using SRM ().
Development of targeted proteomics assays using enriched in vitro synthesized full length proteins
To demonstrate our approach, we studied transcription factors, a diverse class of low-abundance proteins with a paucity of spectral data in public databases (Supplementary Fig. 1
). We selected 96 human transcription factor proteins spanning all major structural families11
(). For each of these proteins, we obtained full-length cDNA clones contained within an in vitro
transcription/translation compatible vector with an in-frame c-terminal Schistosoma japonicum
glutathione S-transferase (GST) tag12
(Supplementary Data 1
). We then optimized in vitro
protein production and purification in a 96-well plate format. We tested different protein production conditions, capture conditions, wash conditions and digestion conditions to develop a protocol that gave maximal protein yield at the highest possible purity (Methods). To verify that enriched full-length proteins were produced, we performed silver-staining and western blotting analyses for 46 of the 96 proteins ( and Supplementary Fig. 2
). For nearly all of the tested proteins, the target protein and the two endogenous glutathione-binding proteins GSTM3 and EEF1G were the top three most intense bands on silver staining, indicating that SRM signal contamination should be minimal. In total, 96% (44/46) of the tested clones produced highly enriched proteins with the correct molecular weight. The remaining two samples produced multiple species of different molecular weights, likely originating from alternative methionine start codons.
For each protein, we selected peptides and fragment ions to measure using the software package Skyline13,
, an open source application for building SRM methods and analyzing the resulting mass spectrometry data. We focused our analysis on predicted fully tryptic peptides with lengths between 7 and 23 amino acids. For each doubly charged monoisotopic precursor, we monitored singly charged monoisotopic y3
product ions using a TSQ-Vantage triple-quadrupole mass spectrometer. These measurements were imported into Skyline to identify the relative peptide responses and their fragmentation patterns (). An annotated Skyline file containing the measured peptides and fragment ions for all 96 proteins can be found at http://proteome.gs.washington.edu/supplementary_data/IVT_SRM/
To quantify the amount of each protein synthesized, heavy forms of the schistosomal GST peptides LLLEYLEEK and IEAIPQIDK were spiked into each in vitro
synthesis reaction. The light-to-heavy ratio of these two peptides was measured and this ratio was calibrated to generate an absolute quantification curve containing the same amount of the heavy peptides but different known quantities of the light peptide (Supplementary Fig. 3
and Supplementary Note
). Using this approach we determined that all of the 96 tested proteins produced at least 0.5 nM of product ().
Targeted assays can be efficiently developed using in vitro synthesized proteins and applied to measure proteins in vivo
Chromatographic data from each peptide was manually analyzed to determine the quality of the peptide signal. Each peptide was given a quality score between 1 and 4, with 1 being the highest quality (Methods). Only peptides with a quality score of either 1 or 2 were considered for further analysis. On average we were able to identify eight peptides per protein with a quality score of 1 or 2. Additionally, all but two of the proteins assayed had at least one peptide with a quality score of 1 or 2 ( and Supplementary Data 1
). Of note, although sufficient quantities of both CEBPG and HMGA1 protein were produced using our in vitro
approach (Supplementary Fig. 2
) and the proteins were sufficiently digested as indicated by the mass spectrometry responses of the GST peptides, none of the monitored tryptic peptides from these two proteins gave a good response in the mass spectrometer. This suggests that a small minority of transcription factor proteins may not be amenable to proteomic analysis using trypsin-based digestion.
To determine the quality of our fragmentation patterns, we compared our observed peptide fragmentation patterns with those contained in the National Institute of Standards and Technology (NIST) spectral database. Of the 760 peptides in our dataset with a quality score of either 1 or 2, only 18% (136) were represented in the NIST database (Methods). Of these, all had high spectral similarity scores, with 93% having dot-products greater than 0.85 (Supplementary Fig. 4
). This finding mutually corroborates both our data and the NIST database and further highlights the scarcity of proteotypic peptides within large spectral databases.
We next determined the utility of predictor algorithms and shotgun analyses to identify optimal proteotypic peptides. A comparison of our empirical ranking of proteotypic peptides with peptide rank predictions from the ESPP redictor algorithm6
revealed spearman correlations ranging from −0.45 to 0.85 with an average correlation of 0.47 (Supplementary Data 2
and Supplementary Fig. 5
). Similarly, roughly half of the optimal proteotypic peptides from our experiments were undetected by shotgun analyses of the identical samples (Supplementary Fig. 6
). While these approaches are better than selecting proteotypic peptides at random, our results suggest that current predictor algorithms and spectral counting approaches provide imperfect ranking and identification of optimal proteotypic peptides – potentially limiting the utility of large-scale peptide synthesis efforts that rely on such approaches as a first round filter7
Finally, we sought to confirm the utility of proteotypic peptides identified using our approach for in vivo
analyses, and how the in vitro
-derived intensity rankings compared with those from complex biological samples. To test this, we first monitored all 12 of the quality score 1 and 2 peptides from the genomic master regulatory transcription factor CTCF in trypsin-digested nuclear lysate from erythrolukemia cells (K562). Using the fragmentation patterns identified in vitro
, we identified corresponding chromatographic peaks for six of these CTCF peptides in K562 nuclear extract (). The relative intensity of these peptides in vitro
and in vivo
closely matched, confirming the relevance of the rank order of peptides identified empirically using in vitro
-synthesized protein (). Next, we selected top-ranking peptides from four transcription factors and used these to generate nuclear abundance measurements of these factors across four distinct cell types (). The relative abundance measurements are consistent with previous reports on the tissue distribution of these transcription factors using RNA abundance14,15
In summary, we demonstrate and validate a rapid and cost-efficient method for empirical identification of optimal proteotypic peptides and their fragmentation patterns using in vitro-synthesized proteins. Our method can be readily applied to generate assays to identify and quantify structurally diverse low-abundance proteins, such as human transcription factors, in unfractionated cellular extracts.