Defining empiric peptide classification training set
The National Cancer Institute Clinical Proteomic Technology Assessment in Cancer Program (NCI-CPTAC) prepared a tryptic digest of a yeast lysate sample and sent it to three proteomic laboratories: Vanderbilt University, New York University (NYU) and the Broad Institute. All laboratories were expected to follow the same MS protocol on an LTQ-Orbitrap mass spectrometer. Vanderbilt analyzed the sample in duplicate on two instruments, NYU analyzed the sample in duplicate, and the Broad Institute performed six replicates. Thus, the yeast lysate was analyzed 12 times across four LTQ-Orbitraps. The raw files were searched using Spectrum Mill v3.4 beta with a precursor mass tolerance of 0.05 Da and fragment mass tolerance of 0.7 Da, specifying up to two missed cleavages and the following modifications: cysteine carbamidomethylation, carbamylation of N termini and lysine, oxidized methionine and pyroglutamic acid. The tandem MS (MS/MS) data were autovalidated at the protein level with a protein score of 25 and at the peptide level using a score of 13, percent similarity of 70%, forward-reverse score of 2, and rank 1-2 score difference of 2, for all charge states. In total, 4,230 peptides (570 proteins) were identified. The peptide identities, m/z, and retention time were exported to calculate the XIC for the monoisotopic peak.
The XIC for each peptide (in a given charge state) was calculated by determining the location (m/z and retention time) of the peptide peak. If a peptide was sequenced multiple times (that is, has many MS/MS spectra), the peptide with the best Spectrum Mill score on a per charge basis was used for this purpose. Peptides with the highest score indicate the highest confidence in matching the fragment spectra compared to spectra with lower scores for the same peptide.
In each LC-MS/MS run, different sets of peptides were sequence identified owing to the stochastic behavior of the mass spectrometer. Therefore, retention times were propagated across different LC-MS/MS runs using a quadratic regression model (R2 = 0.99 for all LC-MS/MS runs). This yielded an approximate elution time, and allowed us to ‘hunt’ for peptides not sequence identified in a particular LC-MS/MS run. The XIC was calculated using a combination of retention time and m/z for each peptide.
An in-house program was developed to automatically calculate the XIC using the Thermo Software Development Kit. The XIC was calculated using a retention time tolerance of ± 2.5 min and m/z tolerance of ± 15 p.p.m. A summary table was created where the response for each peptide was obtained by summing the XIC values for all peptide variations (that is, peptides with multiple charge states and common modifications). This reduced the list to 3,637 peptides.
The yeast LC-MS/MS runs from each institute (Vanderbilt, NYU, Broad Institute) were then median normalized to account for any instrument or processing differences (which were expected to be minor because all samples were processed following the same protocol). The median normalization divides each LC-MS/MS run by its median XIC value and then multiplies it by the common median XIC (the median of the median of all 12 LC-MS/MS runs). A table of identified peptides was created, with their corresponding XIC (if present) in all 12 LC-MS/MS runs. The median of all 12 LC-MS/MS runs was selected as the ‘official’ XIC value for each peptide. Peptides with a coefficient of variance (s.d./mean×100%) >100% were rejected. In addition, any peptide with a median XIC of zero was rejected, indicating that it was not reliably detected in all LC-MS/MS runs.
Next, a set of peptides ‘not detected’ in the mass spectrometer was created. An in silico tryptic digest was performed for all sequence-identified proteins. A substring search was used to remove any in silico peptide where we had evidence of a sequence-identified peptide. For example, if the in silico peptide was LQTISALPK and the sequence-identified peptide was LQTISALPKGDELR, the in silico peptide was rejected because it is a substring of the sequence-identified peptide. Thus, the ‘not detected’ set of peptides was not seen in any form of the sequence-identified peptides. In addition, any peptide sequence that was not unique and any N- or C-terminal peptides (~4% of the peptides) were removed. The final peptide set contained a list of sequence-identified peptides (with their corresponding XIC) and peptides that were not sequence identified in any form.
To classify peptides as high- or low-responding, we considered only proteins with seven or more sequence-identified peptides. The peptide response within each protein was log transformed (excluding peptides ‘not detected’) to create a normal distribution and is justified by the Box Cox transformation33
. The log-transformed data were then standardized, using the z
), within each protein. High-responding peptides were selected with a z
≥ 0 whereas low-responding peptides were selected with a z
≤ −1. This procedure was used only to create the training set and does not apply to the validation sets, where we examined only the five highest-responding peptides. The ‘not detected’ peptides were then appended to the low-responding peptides to create a binary high (n
= 623) versus low/not detected (n
= 2,530) classifier.
Calculation of physicochemical properties for peptides
A diverse set of 550 physicochemical properties was used to calculate the peptide feature set. Properties such as length, number of acidic (glutamic acid, asparagine) and basic (arginine, lysine, histidine) residues were calculated by counting the number of amino acids in each peptide. The Bioinformatics package in Matlab was used to calculate the peptide mass and pI. The gas phase basicity was calculated from Zhang’s model17
. The remaining 544 physicochemical properties contained individual values for each amino acid. For each peptide and a given property, the constituent amino acid numerical values were averaged to produce a single value. Missing values were ignored. The average (rather than median or sum) was chosen because it is sensitive to outliers and normalizes for peptide length. It was assumed that the average physicochemical property across each peptide was sufficient to capture relevant information about peptide response. The model does not incorporate protein context such as flanking amino acids or protein information (e.g., protein molecular weight or protein pI). We view this as a separate problem from predicting high-responding peptides34,35
. Calculations of the peptide feature set were performed in Matlab R2006b (MathWorks).
Random Forest classifier for predicting high-responding peptides
Random Forest is a nonlinear ensemble algorithm composed of many individual decision trees. Each tree is grown using a randomized tree-building algorithm. For each tree (num_tree), a bootstrap sample (that is, random data subset sampled with replacement) is selected from the training set. At each decision branch in the tree, the best spilt is chosen from a randomly selected subset of properties (rather than all properties), num_feature. With these two random steps each tree is different. Predictions result from the ensemble of all trees by taking the majority vote. Instead of relying on this binary classification, a probabilistic output (the fraction of trees that vote high) was used and referred to as probability of response.
The peptide training data were imbalanced. High-responding peptides, the class of interest, comprised only ~20% of the data. Most classifiers focus on optimizing overall accuracy at the expense of misclassifying the minority class (high-responding peptides). Down sampling is a common technique to handle imbalanced data sets36
. In Random Forest, the number of training samples for each class was set to the size of the minority class (n
= 623), and samples were selected via bootstrapping with replacement from both the minority and majority classes. This process was repeated for each tree and exhibits a significant improvement in performance and generalization36
Balanced class sizes were used to optimize num_tree
parameters in Random Forest. The num_feature
parameter was optimized by setting num_tree
to 1,000 and varying num_feature
between 2 and 550 features. The optimal value for num_feature
was determined to be 90 (Supplementary Fig. 8
online). The num_tree
parameter was optimized by increasing the number of trees until the variable importance measure was consistent and reproducible (). The num_tree
parameter was set to 50,000 trees.
The training data were used to calculate a no call region in order to judge the model performance on peptides confidently classified as either high or low/not detected. Peptides with a predicted probability of response between 0.38–0.65 were labeled as no call and the model was not penalized. Peptides with a predicted probability greater than or equal to 0.65 were classified as high and peptides with a predicted probability less than or equal to 0.38 were classified as low/not detected. The reject region was selected based on a false positive rate (1 — specificity) of 10%. This choice of reject region yielded calls on 74% of the training data.
The weighted accuracy was used to account for the imbalanced class size. The weighted accuracy is calculated as: Aw
= 0.5 * (sensitivity + specificity) where sensitivity is the percent of true positives and specificity is the percent of true negatives. The yeast training data were split into training (90%) and test (10%) sets. The training and test set weighted accuracies were 81% and 76%, respectively. We also examined the area under the curve (AUC) for a receiver operating characteristic (ROC) plot24
on the test data. The AUC is a standard measure of performance where a perfect classification would have an AUC of 1 and random classification would have an AUC of 0.5. The AUC for the test set was 83% (P
= 9.4e-9) indicating the predictions are significantly better than random (Supplementary Fig. 9
online). Random Forest and ROC calculations were performed in R (http://CRAN.R-project.org/
) using the Random Forest package v. 4.5-18 (ref. 19
) and ROCR library v. 1.0-2 (ref. 37
Random Forest variable importance score
A measure of how each property contributes to the overall model performance is determined during Random Forest training. When the values for an important property are permuted there should be a noticeable decrease in model accuracy. Likewise, when the values for an irrelevant property are permuted there should be little change in model accuracy. The difference in the two accuracies are then averaged for all trees and normalized by the standard error to produce an importance measure, referred to as the variable importance score.
Permutation test to evaluate the significance of the ESP predictions
All proteins were required to contain at least six or more predicted tryptic peptides (from an in silico digest) and at least five or more sequence-identified peptides. For each protein, the five highest-responding peptides were selected (based on experimental data, down to ‘MRM-MS assay optimization’). Then, using the same protein, five peptides with the highest probability of response were selected using the ESP predictor (). For each validation set, the actual test statistic (Ts) was calculated as the sum of the number of peptides in common between the top five peptides from the experimental and computational methods for each protein. Next, a random test statistic (Trs) was calculated by randomly sampling five peptides and taking the sum of the number of peptides in common with the top five experimentally derived peptides for each protein in the validation set. This process was repeated 10,000 times to produce a null distribution for each validation set. The resulting distribution was used to estimate a one-tailed P-value. Using this procedure, the statistical significance of the predictions made by the ESP predictor was calculated as the number of proteins (also selected at random from the respective validation set) increased. The permutation test implicitly accounts for differences in the number of peptides from each protein. The permutation test calculations were performed in R.
Analysis and MS summary for all validation sets
All protein mixtures were digested using trypsin and analyzed using reversed-phased nano LC-ESI-MS/MS on multiple LTQ Oribtrap and LTQ-FT mass spectrometers (Thermo). Specific conditions concerning chromatography, buffers, injection volume and MS analysis settings varied according to each validation set (full details for all validation sets are provided in the Supplementary Methods
). Validation sets were subsequently processed using either Spectrum Mill 3.4 beta (Agilent Technologies) or Mascot v. 220.127.116.11 (Matrix Science) to determine sequence-identified peptides from the collected MS/MS spectra. Peptide response was calculated using either an in-house developed program to calculate the XIC, MSQuant v. 1.4.2 b5 (http://msquant.sourceforge.net/
), or Spectrum Mill. The total peptide response was calculated by summing all forms of a given peptide (that is, multiple charge states and the following modifications: carbamidomethylation, carbamylated lysine, oxidized methionine and pyroglutamic acid). The following is a brief summary of each validation set:
is a publicly available standard protein mix consisting of 18 proteins provided by the Institute for Systems Biology (ISB)38
. Only the LTQ-FT data were considered.
Yeast test refers to the 10% of proteins held-out from the training set in order to evaluate the model performance.
Plasma refers to neat plasma (that is, undepleted plasma).
refers to a set of 48 equimolar proteins (Universal Proteomics Standard Set, Sigma). The samples were digested using a trifluoroethanol-assisted digestion protocol39
Plasma Hu14 SCX refers to a plasma sample with the 14 most abundant proteins removed using a MARS Hu-14 column (Agilent Technologies) and then fractionated using strong cation exchange (SCX). Eleven fractions were collected and analyzed.
Yeast_2 refers to a separate independent analysis of a yeast mixture. Importantly, proteins in common with the yeast training set were removed.
HeLa_1 refers to HeLaS3 cell lysate digested in-solution.
Pull-Down refers to a GeLC-MS affinity pull-down experiment from a HeLaS3 cell lysate.
Plasma Hu14 refers to a plasma sample with the 14 most abundant proteins removed using a MARS Hu-14 column.
MRM-MS assay development
The validated MRM peptides were defined from single protein digests for each of the 14 proteins. Peptide selection for the 14 target proteins was based upon experimental observation using commercially available protein standards. Briefly, the proteins were individually digested with trypsin and analyzed by nano LC-MS/MS in positive-ion electrospray on an LTQ linear ion trap mass spectrometer (Thermo) with data-dependent acquisition. Peptide-sequence identity was determined using Spectrum Mill on the collected MS/MS spectra. Approximately five candidate peptide standards per protein were chosen based primarily on high relative response. Exclusion criteria included large hydrophobic or small hydrophilic peptides, flanking tryptic ends with dibasic amino acids (KK, RR, KR, RK) at the N or C terminus and peptide identity corresponding to multiple endogenous plasma proteins. Peptide standards containing methionine and cysteine were avoided if possible. Stable isotope—labeled versions of each candidate peptide were synthesized for quantification and MRM response curves were optimized in plasma for each protein over a wide concentration range. All peptides that performed satisfactorily over the response curves are referred to as “validated MRM peptides.”
All MRM experiments were performed on a 4000 Q Trap Hybrid triple quadrupole/linear ion trap mass spectrometer coupled to a Tempo LC system (Applied Biosystems). Data analysis was done using MultiQuant software (Applied Biosystems).
The GPM database was searched (December 19, 2008) by entering the protein Ensembl accession number and then selecting the ‘MRM’ link. For some proteins, a large number of peptides were listed. Only the top five peptides were considered based on the number of times observed in the GPM database.
Data and software availability