The measurement of all mRNA and protein expression levels in organisms is a fundamental biological goal. Though mRNA expression levels are now routinely measured on large scale, methods of high-throughput protein identification like western blotting, 2D gel electrophoresis and green-fluorescent protein (GFP) fusion tagging are very expensive in labor, time and resources. Mass spectrometry (MS) based shotgun proteomics is a simple alternative to these methods. With sensitive tandem mass spectrometry (MS/MS) instruments or extensive biochemical fractionation, several thousand proteins can be identified (Brunner et al.
; Graumann et al.
; Peng et al.
; Washburn et al.
). However, less costly approaches only identify a few hundred proteins in a complex protein sample.
A shotgun proteomics experiment typically proceeds by MS/MS analysis of peptides from proteolytically digested proteins, followed by in silico
matching of the MS/MS spectra against a database of theoretical peptide spectra derived from protein sequences (). Proteins are identified from combined evidence for their composite peptides, resulting in a list in which each protein is associated with a confidence score of correct identification. We refer to this score as the ‘original’, ‘primary’ or ‘raw’ protein identification score, e.g. here using ProteinProphet (Nesvizhskii et al.
). All proteins with scores greater than a chosen threshold are labeled ‘present’ ().
Fig. 1. Boosting protein identifications with prior information on mRNA concentration. A complex protein sample, e.g. cellular extract, is enzymatically digested into peptides and subjected to MS/MS. Raw MS/MS spectra are searched against a database of sequences (more ...)
Protein identification in an MS/MS experiment is hindered by a number of factors: noisy spectra, low-concentration proteins, post-translational modifications and chemical properties that interfere with efficient peptide ionization. For complex samples such as cell lysates, current MS search algorithms typically match a disproportionately small percentage (<20%) of all MS/MS spectra to peptides in a database, and only a small fraction of the expected proteins is identified. In other words, despite their presence in the biological sample, raw MS/MS identification scores of many proteins fall below a given confidence threshold and the proteins are incorrectly labeled as ‘not present’.
The vast majority of MS/MS experiments are analyzed without considering any prior information regarding a protein's presence in the sample. MS/MS protein identification scoring schemes, e.g. BioWorks (ThermoFinnigan) or ProteinProphet (Nesvizhskii et al.
), assume that all proteins are equally likely to be present. In reality, other information may be readily available and can be used to influence the inferred probability of protein presence when evidence from the MS/MS experiment is weak.
Our method, MSpresso (for MS and exPRESSion data), integrates data from MS/MS experiments with mRNA expression data in a Bayesian framework. MSpresso computes a new protein identification score as the posterior probability of the protein being present in the sample given both its MS/MS and mRNA scores.
We demonstrate the applicability of MSpresso on a yeast sample grown in rich medium analyzed on a low-resolution mass spectrometer (LCQ). We use mRNA concentrations from three independent experiments (Holstege et al.
; Velculescu et al.
; Wang et al.
) and corresponding protein data from four MS experiments (Chi et al.
; de Godoy et al.
; Peng et al.
; Washburn et al.
). We compare the performance of MSpresso on the yeast sample with the original raw MS/MS identification scores using ROC (Receiver Operator Characteristic) plots, and find an increase of ~40% in the number of proteins identified at a fixed error rate. We validate 98% of these new identifications by their presence in at least one of the seven independent benchmarking datasets. We also generalize the method and demonstrate its applicability to a data from a high-resolution MS/MS instrument, different biological conditions, as well as to other organisms (Escherichia coli
, human). To the best of our knowledge, MSpresso is the first integrative approach to analysis of shotgun proteomics data.