Motivation: High-throughput protein identification experiments based on tandem mass spectrometry (MS/MS) often suffer from low sensitivity and low-confidence protein identifications. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other evidence to suggest that a protein is present and confidence in individual protein identification can be updated accordingly.
Results: We develop a method that analyzes MS/MS experiments in the larger context of the biological processes active in a cell. Our method, MSNet, improves protein identification in shotgun proteomics experiments by considering information on functional associations from a gene functional network. MSNet substantially increases the number of proteins identified in the sample at a given error rate. We identify 8–29% more proteins than the original MS experiment when applied to yeast grown in different experimental conditions analyzed on different MS/MS instruments, and 37% more proteins in a human sample. We validate up to 94% of our identifications in yeast by presence in ground-truth reference sets.
Availability and Implementation: Software and datasets are available at http://aug.csres.utexas.edu/msnet
Contact: firstname.lastname@example.org, email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
The problem of identifying proteins from a shotgun proteomics experiment has not been definitively solved. Identifying the proteins in a sample requires ranking them, ideally with interpretable scores. In particular, “degenerate” peptides, which map to multiple proteins, have made such a ranking difficult to compute. The problem of computing posterior probabilities for the proteins, which can be interpreted as confidence in a protein’s presence, has been especially daunting. Previous approaches have either ignored the peptide degeneracy problem completely, addressed it by computing a heuristic set of proteins or heuristic posterior probabilities, or by estimating the posterior probabilities with sampling methods. We present a probabilistic model for protein identification in tandem mass spectrometry that recognizes peptide degeneracy. We then introduce graph-transforming algorithms that facilitate efficient computation of protein probabilities, even for large data sets. We evaluate our identification procedure on five different well-characterized data sets and demonstrate our ability to efficiently compute high-quality protein posteriors.
Shotgun proteomics using mass spectrometry is a powerful method for protein identification but suffers limited sensitivity in complex samples. Integrating peptide identifications from multiple database search engines is a promising strategy to increase the number of peptide identifications and reduce the volume of unassigned tandem mass spectra. Existing methods pool statistical significance scores such as p-values or posterior probabilities of peptide-spectrum matches (PSMs) from multiple search engines after high scoring peptides have been assigned to spectra, but these methods lack reliable control of identification error rates as data are integrated from different search engines. We developed a statistically coherent method for integrative analysis, termed MSblender. MSblender converts raw search scores from search engines into a probability score for all possible PSMs and properly accounts for the correlation between search scores. The method reliably estimates false discovery rates and identifies more PSMs than any single search engine at the same false discovery rate. Increased identifications increment spectral counts for all detected proteins and allow quantification of proteins that would not have been quantified by individual search engines. We also demonstrate that enhanced quantification contributes to improve sensitivity in differential expression analyses.
integrative analysis; database search; peptide identification
Proteogenomics has the potential to advance genome annotation
through high quality peptide identifications derived from mass spectrometry
experiments, which demonstrate a given gene or isoform is expressed
and translated at the protein level. This can advance our understanding
of genome function, discovering novel genes and gene structure that
have not yet been identified or validated. Because of the high-throughput
shotgun nature of most proteomics experiments, it is essential to
carefully control for false positives and prevent any potential misannotation.
A number of statistical procedures to deal with this are in wide use
in proteomics, calculating false discovery rate (FDR) and posterior
error probability (PEP) values for groups and individual peptide spectrum
matches (PSMs). These methods control for multiple testing and exploit
decoy databases to estimate statistical significance. Here, we show
that database choice has a major effect on these confidence estimates
leading to significant differences in the number of PSMs reported.
We note that standard target:decoy approaches using six-frame translations
of nucleotide sequences, such as assembled transcriptome data, apparently
underestimate the confidence assigned to the PSMs. The source of this
error stems from the inflated and unusual nature of the six-frame
database, where for every target sequence there exists five “incorrect”
targets that are unlikely to code for protein. The attendant FDR and
PEP estimates lead to fewer accepted PSMs at fixed thresholds, and
we show that this effect is a product of the database and statistical
modeling and not the search engine. A variety of approaches to limit
database size and remove noncoding target sequences are examined and
discussed in terms of the altered statistical estimates generated
and PSMs reported. These results are of importance to groups carrying
out proteogenomics, aiming to maximize the validation and discovery
of gene structure in sequenced genomes, while still controlling for
proteogenomics; peptide spectrum match; false
discovery rate; posterior error probability; expressed
This manuscript provides a comprehensive review of the peptide and protein identification process using tandem mass spectrometry (MS/MS) data generated in shotgun proteomic experiments. The commonly used methods for assigning peptide sequences to MS/MS spectra are critically discussed and compared, from basic strategies to advanced multi-stage approaches. A particular attention is paid to the problem of false-positive identifications. Existing statistical approaches for assessing the significance of peptide to spectrum matches are surveyed, ranging from single-spectrum approaches such as expectation values to global error rate estimation procedures such as false discovery rates and posterior probabilities. The importance of using auxiliary discriminant information (mass accuracy, peptide separation coordinates, digestion properties, and etc.) is discussed, and advanced computational approaches for joint modeling of multiple sources of information are presented. This review also includes a detailed analysis of the issues affecting the interpretation of data at the protein level, including the amplification of error rates when going from peptide to protein level, and the ambiguities in inferring the identifies of sample proteins in the presence of shared peptides. Commonly used methods for computing protein-level confidence scores are discussed in detail. The review concludes with a discussion of several outstanding computational issues.
Proteomics; Bioinformatics; Mass Spectrometry; Peptide Identification; Protein Inference; Statistical Models; False Discovery Rates
Shotgun proteomics has been used extensively for characterization of a number of proteomes. High resolution Fourier transform mass spectrometry (FTMS) has emerged as a powerful tool owing to its high mass accuracy and resolving power. One of its major limitations, however, is that the confidence level of peptide identification and sensitivity cannot be maximized simultaneously. Although it is generally assumed that higher resolution is better for peptide identifications, the precise effect of varying resolution as a parameter on peptide identification has not yet been systematically evaluated. We used the Escherichia coli proteome and a standard 48 protein mix to study the effect of different resolution parameters on peptide identifications in the setting of a shotgun proteomics experiment on an LTQ-Orbitrap mass spectrometer. We observed a higher number of peptide-spectrum matches (PSMs) whenever the MS scan was carried out by FT and the MS/MS in the ion-trap (IT) with the maximum PSMs obtained at an MS resolution of 30,000. In contrast, when samples were analyzed by FT for both MS and MS/MS, the number of PSMs was significantly lower (~40% as compared to FT-IT experiments) with the maximum PSMs obtained when both the MS and MS/MS resolution were set to 15,000. Thus, a 15K-15K resolution setting may provide the best compromise for studies where both speed and accuracy such as high-throughput post-translational analysis and de novo sequencing are important. We hope that our study will allow researchers to choose between different resolution parameters to achieve their desired results from proteomic analyses.
FTMS; duty cycle; E. coli proteome; PSM
Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughput nature, shotgun proteomics faces challenges with respect to the analysis and interpretation of experimental data. Among such challenges, the identification of proteins present in a sample has been recognized as an important computational task. This task generally consists of (1) assigning experimental tandem mass spectra to peptides derived from a protein database, and (2) mapping assigned peptides to proteins and quantifying the confidence of identified proteins. Protein identification is fundamentally a statistical inference problem with a number of methods proposed to address its challenges. In this review we categorize current approaches into rule-based, combinatorial optimization and probabilistic inference techniques, and present them using integer programing and Bayesian inference frameworks. We also discuss the main challenges of protein identification and propose potential solutions with the goal of spurring innovative research in this area.
In shotgun proteomics, a complex protein mixture is digested to peptides, separated and identified by microcapillary liquid chromatography followed by tandem mass spectrometry (LC-MS-MS). In this technology, complete protein digestion is often assumed. We show that, to the contrary, modifications to a standard digestion protocol demonstrate large, reproducible improvements in protein identification, a result consistent with digestion being a limiting factor in the efficiency of protein identification.
mass spectrometry; proteomics; digestion; protein identification
Peptides are routinely identified from mass spectrometry-based proteomics experiments by matching observed spectra to peptides derived from protein databases. The error rates of these identifications can be estimated by target-decoy analysis, which involves matching spectra to shuffled or reversed peptides. Besides estimating error rates, decoy searches can be used by semi-supervised machine learning algorithms to increase the number of confidently identified peptides. As for all machine learning algorithms, however, the results must be validated to avoid issues such as overfitting or biased learning, which would produce unreliable peptide identifications. Here, we discuss how the target-decoy method is employed in machine learning for shotgun proteomics, focusing on how the results can be validated by cross-validation, a frequently used validation scheme in machine learning. We also use simulated data to demonstrate the proposed cross-validation scheme's ability to detect overfitting.
In shotgun proteomics, protein identification by tandem mass spectrometry relies on bioinformatics tools. Despite recent improvements in identification algorithms, a significant number of high quality spectra remain unidentified for various reasons. Here we present ScanRanker, an open-source tool that evaluates the quality of tandem mass spectra via sequence tagging with reliable performance in data from different instruments. The superior performance of ScanRanker enables it not only to find unassigned high quality spectra that evade identification through database search, but also to select spectra for de novo sequencing and cross-linking analysis. In addition, we demonstrate that the distribution of ScanRanker scores predicts the richness of identifiable spectra among multiple LC-MS/MS runs in an experiment, and ScanRanker scores assist the process of peptide assignment validation to increase confident spectrum identifications. The source code and executable versions of ScanRanker are available from http://fenchurch.mc.vanderbilt.edu.
spectral quality; sequence tagging; bioinformatics; tandem mass spectrometry; cross-linking
Tandem mass spectrometry-based shotgun proteomics has become a widespread technology for analyzing complex protein mixtures. A number of database searching algorithms have been developed to assign peptide sequences to tandem mass spectra. Assembling the peptide identifications to proteins, however, is a challenging issue because many peptides are shared among multiple proteins. IDPicker is an open-source protein assembly tool that derives a minimum protein list from peptide identifications filtered to a specified False Discovery Rate. Here, we update IDPicker to increase confident peptide identifications by combining multiple scores produced by database search tools. By segregating peptide identifications for thresholding using both the precursor charge state and the number of tryptic termini, IDPicker retrieves more peptides for protein assembly. The new version is more robust against false positive proteins, especially in searches using multispecies databases, by requiring additional novel peptides in the parsimony process. IDPicker has been designed for incorporation in many identification workflows by the addition of a graphical user interface and the ability to read identifications from the pepXML format. These advances position IDPicker for high peptide discrimination and reliable protein assembly in large-scale proteomics studies. The source code and binaries for the latest version of IDPicker are available from http://fenchurch.mc.vanderbilt.edu/.
bioinformatics; parsimony; protein assembly; protein inference; false discovery rate
The fission yeast Schizosaccharomyces pombe is a widely used model organism to study basic mechanisms of eukaryotic biology, but unlike other model organisms, its proteome remains largely uncharacterized. Using a shotgun proteomics approach based on multidimensional prefractionation and tandem mass spectrometry, we have detected ∼30% of the theoretical fission yeast proteome. Applying statistical modelling to normalize spectral counts to the number of predicted tryptic peptides, we have performed label-free quantification of 1465 proteins. The fission yeast protein data showed considerable correlations with mRNA levels and with the abundance of orthologous proteins in budding yeast. Functional pathway analysis indicated that the mRNA–protein correlation is strong for proteins involved in signalling and metabolic processes, but increasingly discordant for components of protein complexes, which clustered in groups with similar mRNA–protein ratios. Self-organizing map clustering of large-scale protein and mRNA data from fission and budding yeast revealed coordinate but not always concordant expression of components of functional pathways and protein complexes. This finding reaffirms at the protein level the considerable divergence in gene expression patterns of the two model organisms that was noticed in previous transcriptomic studies.
fission yeast; LC-MS/MS; mRNA–protein correlation; relative protein quantification; protein profiling
Protein inference from peptide identifications in shotgun proteomics must deal with ambiguities that arise due to the presence of peptides shared between different proteins, which is common in higher eukaryotes. Recently data independent acquisition (DIA) approaches have emerged as an alternative to the traditional data dependent acquisition (DDA) in shotgun proteomics experiments. MSE is the term used to name one of the DIA approaches used in QTOF instruments. MSE data require specialized software to process acquired spectra and to perform peptide and protein identifications. However the software available at the moment does not group the identified proteins in a transparent way by taking into account peptide evidence categories. Furthermore the inspection, comparison and report of the obtained results require tedious manual intervention. Here we report a software tool to address these limitations for MSE data.
In this paper we present PAnalyzer, a software tool focused on the protein inference process of shotgun proteomics. Our approach considers all the identified proteins and groups them when necessary indicating their confidence using different evidence categories. PAnalyzer can read protein identification files in the XML output format of the ProteinLynx Global Server (PLGS) software provided by Waters Corporation for their MSE data, and also in the mzIdentML format recently standardized by HUPO-PSI. Multiple files can also be read simultaneously and are considered as technical replicates. Results are saved to CSV, HTML and mzIdentML (in the case of a single mzIdentML input file) files. An MSE analysis of a real sample is presented to compare the results of PAnalyzer and ProteinLynx Global Server.
We present a software tool to deal with the ambiguities that arise in the protein inference process. Key contributions are support for MSE data analysis by ProteinLynx Global Server and technical replicates integration. PAnalyzer is an easy to use multiplatform and free software tool.
Peptide labeling with isobaric tags has become a popular technique in quantitative shotgun proteomics. Using two different samples viz. a protein mixture and HeLa extracts, we show that three commercially available isobaric tags differ with regard to peptide identification rates: The number of identified proteins and peptides was largest with iTRAQ 4-plex, followed by TMT 6-plex, and smallest with iTRAQ 8-plex. In all experiments, we employed a previously described method where two scans were acquired for each precursor on an LTQ Orbitrap: A CID scan under standard settings for identification, and a HCD scan for quantification. The observed differences in identification rates were similar when data was searched with either Mascot or Sequest. We consider these findings to be the result of a combination of several factors, most notably prominent ions in CID spectra as a consequence of loss of fragments of the label tag from precursor ions. These fragment ions cannot be explained by current search engines and were observed to have a negative impact on peptide scores.
Mass spectrometry-based protein identification methods are fundamental to proteomics. Biological experiments are usually performed in replicates and proteomic analyses generate huge datasets which need to be integrated and quantitatively analyzed. The Sequest™ search algorithm is a commonly used algorithm for identifying peptides and proteins from two dimensional liquid chromatography electrospray ionization tandem mass spectrometry (2-D LC ESI MS2) data. A number of proteomic pipelines that facilitate high throughput 'post data acquisition analysis' are described in the literature. However, these pipelines need to be updated to accommodate the rapidly evolving data analysis methods. Here, we describe a proteomic data analysis pipeline that specifically addresses two main issues pertinent to protein identification and differential expression analysis: 1) estimation of the probability of peptide and protein identifications and 2) non-parametric statistics for protein differential expression analysis. Our proteomic analysis workflow analyzes replicate datasets from a single experimental paradigm to generate a list of identified proteins with their probabilities and significant changes in protein expression using parametric and non-parametric statistics.
The input for our workflow is Bioworks™ 3.2 Sequest (or a later version, including cluster) output in XML format. We use a decoy database approach to assign probability to peptide identifications. The user has the option to select "quality thresholds" on peptide identifications based on the P value. We also estimate probability for protein identification. Proteins identified with peptides at a user-specified threshold value from biological experiments are grouped as either control or treatment for further analysis in ProtQuant. ProtQuant utilizes a parametric (ANOVA) method, for calculating differences in protein expression based on the quantitative measure ΣXcorr. Alternatively ProtQuant output can be further processed using non-parametric Monte-Carlo resampling statistics to calculate P values for differential expression. Correction for multiple testing of ANOVA and resampling P values is done using Benjamini and Hochberg's method. The results of these statistical analyses are then combined into a single output file containing a comprehensive protein list with probabilities and differential expression analysis, associated P values, and resampling statistics.
For biologists carrying out proteomics by mass spectrometry, our workflow facilitates automated, easy to use analyses of Bioworks (3.2 or later versions) data. All the methods used in the workflow are peer-reviewed and as such the results of our workflow are compliant with proteomic data submission guidelines to public proteomic data repositories including PRIDE. Our workflow is a necessary intermediate step that is required to link proteomics data to biological knowledge for generating testable hypotheses.
Shotgun proteomics protocols are widely used for the identification and/or quantitation of proteins in complex biological samples. Described here is a shotgun proteomics protocol that can be used to identify the protein targets of biologically relevant ligands in complex protein mixtures. The protocol combines a quantitative proteomics platform with a covalent modification strategy, termed Stability of Proteins from Rates of Oxidation (SPROX), which utilizes the denaturant dependence of hydrogen peroxide-mediated oxidation of methionine side chains in proteins to assess the thermodynamic properties of proteins and protein-ligand complexes. The quantitative proteomics platform involves the use of isobaric mass tags and a methionine-containing peptide enhancement strategy. The protocol is evaluated in a ligand binding experiment designed to identify the proteins in a yeast cell lysate that bind the well-known enzyme co-factor, β-nicotinamide adenine dinucleotide (NAD+). The protocol is also used to investigate the protein targets of resveratrol, a biologically active ligand with less well-understood protein targets. A known protein target of resveratrol, cytosolic aldehyde dehydrogenase, was identified in addition to six other potential new proteins targets including four that are associated with the protein translation machinery, which has previously been implicated as a target of resveratrol.
Protein-ligand binding; resveratrol; NAD+; methionine oxidation; iTRAQ; SPROX
Analysis of tissues from cancers, precancers, and normal tissues provides a means to identify candidate markers for disease detection. The only proteomic technology platform capable of large-scale inventory and identification of serum proteins is shotgun proteomics, in which proteins are first digested to peptides and then the peptides are subjected to analysis by multidimensional liquid chromatography-tandem MS (LC-MS-MS). However, current implementations of shotgun proteome analyses are limited in both sample throughput and reproducibility in identification and detection, particularly for lower-abundance proteins. Here, we describe efforts to refine, standardize, and implement shotgun proteomics platforms for application to high-throughput analysis of clinical tissue specimens. Our guiding principles in developing and standardizing shotgun proteome analysis platforms, in order of decreasing priority, are to (1) achieve sufficient reproducibility to allow single analyses to replace multiple replicates, (2) reduce the amount of MS instrument time required for analysis, thus increasing throughput, and (3) achieve the greatest sensitivity and depth of coverage possible, with the ultimate goal of equaling or exceeding the performance of lower-throughput shotgun proteome analyses in current use. Refinement of the multidimensional LC-MS-MS platform is focused on (1) improving the reproducibility and standardization of peptide separations by replacing strong cation exchange separations with isoelectric focusing on immobilized pH gradient strips; (2) employing new methods to acquire MS-MS spectra in LC-MS-MS analyses using hybrid LTQ-Orbitrap instruments; (3) applying new data-analysis algorithms and software to identify peptides and proteins from MS-MS data and to quantify with label-free methods. A major challenge is the statistical comparison of multiple complex datasets derived by shotgun analyses to identify tissue-specific proteomic characteristics that can be selected as candidate markers.
The use of nLC-ESI-MS/MS in shotgun proteomics experiments and GeLC-MS/MS analysis is well accepted and routinely available in most proteomics laboratories. However, the same cannot be said for nLC-MALDI MS/MS, which has yet to experience such widespread acceptance, despite the fact that the MALDI technology offers several critical advantages over ESI. As an illustration, in an analysis of moderately complex sample of E. coli proteins, the use MALDI in addition to ESI in GeLC-MS/MS resulted in a 16% average increase in protein identifications, while with more complex samples the number of additional protein identifications increased by an average of 45%. The size of the unique peptides identified by MALDI was, on average, 25% larger than the unique peptides identified by ESI, and they were found to be slightly more hydrophilic. The insensitivity of MALDI to the presence of ionization suppression agents was shown to be a significant advantage, suggesting it be used as a complement to ESI when ion suppression is a possibility. Furthermore, the higher resolution of the TOF/TOF instrument improved the sensitivity, accuracy, and precision of the data over that obtained using only ESI-based iTRAQ experiments using a linear ion trap. Nevertheless, accurate data can be generated with either instrument. These results demonstrate that coupling nanoLC with both ESI and MALDI ionization interfaces improves proteome coverage, reduces the deleterious effects of ionization suppression agents, and improves quantitation, particularly in complex samples.
nLC-ESI-MS/MS; nLC-MALDI-MS/MS; protein identification; quantitation; quadrupole linear ion trap; tandem time-of-flight; mass spectrometry
Motivation: Liquid chromatography tandem mass spectrometry (LC-MS/MS) is the predominant method to comprehensively characterize complex protein mixtures such as samples from prefractionated or complete proteomes. In order to maximize proteome coverage for the studied sample, i.e. identify as many traceable proteins as possible, LC-MS/MS experiments are typically repeated extensively and the results combined. Proteome coverage prediction is the task of estimating the number of peptide discoveries of future LC-MS/MS experiments. Proteome coverage prediction is important to enhance the design of efficient proteomics studies. To date, there does not exist any method to reliably estimate the increase of proteome coverage at an early stage.
Results: We propose an extended infinite Markov model DiriSim to extrapolate the progression of proteome coverage based on a small number of already performed LC-MS/MS experiments. The method explicitly accounts for the uncertainty of peptide identifications. We tested DiriSim on a set of 37 LC-MS/MS experiments of a complete proteome sample and demonstrated that DiriSim correctly predicts the coverage progression already from a small subset of experiments. The predicted progression enabled us to specify maximal coverage for the test sample. We demonstrated that quality requirements on the final proteome map impose an upper bound on the number of useful experiment repetitions and limit the achievable proteome coverage.
Contact: firstname.lastname@example.org; email@example.com
Identification of proteins by tandem mass spectrometry requires a database of the proteins that could be in the sample. This is available for model species (e.g. humans) but not for non-model species. Ideally, for a non-model species the sequencing of expressed mRNA would generate a protein database for mass spectrometry based identification, allowing detection of genes and proteins using high throughput sequencing and protein identification technologies. Here we use human cells infected with human adenovirus as a complex and dynamic model to demonstrate this approach is robust. Our Proteomics Informed by Transcriptomics technique identifies >99% of over 3700 distinct proteins identified using traditional analysis reliant on comprehensive human and adenovirus protein lists. This facilitates high throughput acquisition of direct evidence for transcripts and proteins in non-model species. Critically, we show this approach can also be used to highlight genes and proteins undergoing dynamic changes in post transcriptional protein stability.
Summary and recent advances
Mass spectrometry, specifically the analysis of complex peptide mixtures by liquid chromatography and tandem mass spectrometry (shotgun proteomics) has been at the center of proteomics research for the last decade. To overcome some of the fundamental limitations of the approach, including its limited sensitivity and high degree of redundancy, new proteomics workflows are being developed. Among these, targeting methods in which specific peptides are selectively isolated, identified and quantified are particularly promising. Here we summarize recent incremental advances in shotgun proteomics methods and outline emerging targeted workflows. The development of the target driven approaches with their ability to detect and quantify identical, non-redundant sets of proteins in multiple repeat analyses will be critically important for the application of proteomics to biomarker discovery and validation, and to systems biology research.
Objective: Analyze how precursor and fragment mass tolerance affect the number of true positives and false positives. Introduction: Mass spectrometry coupled to database searching is a powerful and popular protein identification tool. A typical shotgun proteomics experiment begins with degrading intact proteins into peptides. The peptide mixture then undergoes LC-MS/MS analysis, and the resulting experimental spectra are compared to theoretical spectra derived from protein, cDNA, or EST databases. Successful database searching is dependent on database size, post-translational modifications, and precursor and fragment ion m/z tolerance. Method: A standard protein set was made containing 62 verified T. cruzi recombinant proteins spiked into an E. coli lysate. This mixture was digested then analyzed by LC-MS/MS using an LTQ-Orbitrap. Resulting spectra were searched against forward, reverse, and concatenated databases using Sequest, Mascot, and X!Tandem. Peptide probabilities were calculated using ProteinProphet, and peptide false discovery rates (FDR's) were calculated by using ProteoIQ. It is necessary to use a standardized protein mixture to determine the number of true positives (T. cruzi proteins) and false positives (random proteins) found as a function of m/z search tolerance. Preliminary Results: At a 95% probability, more true positives are discovered as ion precursor mass accuracy is increased; however, more false positives are also discovered and at a higher rate. For example, as mass accuracy is increased from +/−1000ppm to +/−20ppm, the number of spectra corresponding to true positives increases by 50% while the number for false positives increases by 380%. Using a 5% FDR filter with the same mass accuracy change yields a 37% increase in true positive matches, while leaving the number of false positives unchanged. Conclusions: FDR filtering can result in more successful data validation than probability filtering when performing high resolution mass spectrometry.
Mass spectrometry (MS)-based shotgun proteomics allows protein identifications even in complex biological samples. Protein abundances can then be estimated from the counts of MS/MS spectra attributable to each protein, provided that one corrects for differential MS-detectability of the contributing peptides. We describe the use of a method, APEX, which calculates Absolute Protein EXpression levels based on learned correction factors, MS/MS spectral counts, and each protein's probability of correct identification.
The APEX-based calculations consist of three parts: (1) Using training data, peptide sequences and their sequence properties, a model is built that can be used to estimate MS-detectability (Oi) for any given protein. (2) Absolute abundances of proteins measured in an MS/MS experiment are calculated with information from spectral counts, identification probabilities and the learned Oi -values. (3) Simple statistics allow for significance analysis of differential expression in two distinct biological samples, i.e., measuring relative protein abundances. APEX-based protein abundances span more than four orders of magnitude and are applicable to mixtures of hundreds to thousands of proteins from any type of organism.
Quantitative proteomics; Protein expression; Label-free mass spectrometry; Spectral counting
Signal transduction pathways that are modulated by thiol oxidation events are beginning to be uncovered, but these discoveries are limited by the availability of relatively few analytical methods to examine protein oxidation compared to other signaling events such as protein phosphorylation. We report here the coupling of PROP, a method to purify reversibly oxidized proteins, with the proteomic identification of the purified mixture using mass spectrometry. A gene ontology (GO), KEGG enrichment and Wikipathways analysis of the identified proteins indicated a significant enrichment in proteins associated with both translation and mRNA splicing. This methodology also enabled the identification of some of the specific cysteine residue targets within identified proteins that are reversibly oxidized by hydrogen peroxide treatment of intact cells. From these identifications, we determined a potential consensus sequence motif associated with oxidized cysteine residues. Furthermore, because we identified proteins and specific sites of oxidation from both abundant proteins and from far less abundant signaling proteins (e.g. hepatoma derived growth factor, prostaglandin E synthase 3), the results suggest that the PROP procedure was efficient. Thus, this PROP-proteomics methodology offers a sensitive means to identify biologically relevant redox signaling events that occur within intact cells.
A goal of proteomics is to distinguish between states of a biological system by identifying protein expression differences. Liu et al. demonstrated a method to perform semi-relative protein quantitation in shotgun proteomics data by correlating the number of tandem mass spectra obtained for each protein, or "spectral count", with its abundance in a mixture; however, two issues have remained open: how to normalize spectral counting data and how to efficiently pinpoint differences between profiles. Moreover, Chen et al. recently showed how to increase the number of identified proteins in shotgun proteomics by analyzing samples with different MS-compatible detergents while performing proteolytic digestion. The latter introduced new challenges as seen from the data analysis perspective, since replicate readings are not acquired.
To address the open issues above, we present a program termed PatternLab for proteomics. This program implements existing strategies and adds two new methods to pinpoint differences in protein profiles. The first method, ACFold, addresses experiments with less than three replicates from each state or having assays acquired by different protocols as described by Chen et al. ACFold uses a combined criterion based on expression fold changes, the AC test, and the false-discovery rate, and can supply a "bird's-eye view" of differentially expressed proteins. The other method addresses experimental designs having multiple readings from each state and is referred to as nSVM (natural support vector machine) because of its roots in evolutionary computing and in statistical learning theory. Our observations suggest that nSVM's niche comprises projects that select a minimum set of proteins for classification purposes; for example, the development of an early detection kit for a given pathology. We demonstrate the effectiveness of each method on experimental data and confront them with existing strategies.
PatternLab offers an easy and unified access to a variety of feature selection and normalization strategies, each having its own niche. Additionally, graphing tools are available to aid in the analysis of high throughput experimental data. PatternLab is available at .