Motivation: Tandem mass spectrometry (MS/MS) offers fast and reliable characterization of complex protein mixtures, but suffers from low sensitivity in protein identification. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other information available, e.g. the probability of a protein's presence is likely to correlate with its mRNA concentration.
Results: We develop a Bayesian score that estimates the posterior probability of a protein's presence in the sample given its identification in an MS/MS experiment and its mRNA concentration measured under similar experimental conditions. Our method, MSpresso, substantially increases the number of proteins identified in an MS/MS experiment at the same error rate, e.g. in yeast, MSpresso increases the number of proteins identified by ∼40%. We apply MSpresso to data from different MS/MS instruments, experimental conditions and organisms (Escherichia coli, human), and predict 19–63% more proteins across the different datasets. MSpresso demonstrates that incorporating prior knowledge of protein presence into shotgun proteomics experiments can substantially improve protein identification scores.
Availability and Implementation: Software is available upon request from the authors. Mass spectrometry datasets and supplementary information are available from http://www.marcottelab.org/MSpresso/.
Contact: email@example.com; firstname.lastname@example.org
Supplementary Information: Supplementary data website: http://www.marcottelab.org/MSpresso/.
Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughput nature, shotgun proteomics faces challenges with respect to the analysis and interpretation of experimental data. Among such challenges, the identification of proteins present in a sample has been recognized as an important computational task. This task generally consists of (1) assigning experimental tandem mass spectra to peptides derived from a protein database, and (2) mapping assigned peptides to proteins and quantifying the confidence of identified proteins. Protein identification is fundamentally a statistical inference problem with a number of methods proposed to address its challenges. In this review we categorize current approaches into rule-based, combinatorial optimization and probabilistic inference techniques, and present them using integer programing and Bayesian inference frameworks. We also discuss the main challenges of protein identification and propose potential solutions with the goal of spurring innovative research in this area.
Proteogenomics has the potential to advance genome annotation
through high quality peptide identifications derived from mass spectrometry
experiments, which demonstrate a given gene or isoform is expressed
and translated at the protein level. This can advance our understanding
of genome function, discovering novel genes and gene structure that
have not yet been identified or validated. Because of the high-throughput
shotgun nature of most proteomics experiments, it is essential to
carefully control for false positives and prevent any potential misannotation.
A number of statistical procedures to deal with this are in wide use
in proteomics, calculating false discovery rate (FDR) and posterior
error probability (PEP) values for groups and individual peptide spectrum
matches (PSMs). These methods control for multiple testing and exploit
decoy databases to estimate statistical significance. Here, we show
that database choice has a major effect on these confidence estimates
leading to significant differences in the number of PSMs reported.
We note that standard target:decoy approaches using six-frame translations
of nucleotide sequences, such as assembled transcriptome data, apparently
underestimate the confidence assigned to the PSMs. The source of this
error stems from the inflated and unusual nature of the six-frame
database, where for every target sequence there exists five “incorrect”
targets that are unlikely to code for protein. The attendant FDR and
PEP estimates lead to fewer accepted PSMs at fixed thresholds, and
we show that this effect is a product of the database and statistical
modeling and not the search engine. A variety of approaches to limit
database size and remove noncoding target sequences are examined and
discussed in terms of the altered statistical estimates generated
and PSMs reported. These results are of importance to groups carrying
out proteogenomics, aiming to maximize the validation and discovery
of gene structure in sequenced genomes, while still controlling for
proteogenomics; peptide spectrum match; false
discovery rate; posterior error probability; expressed
Motivation: Complex patterns of protein phosphorylation mediate many cellular processes. Tandem mass spectrometry (MS/MS) is a powerful tool for identifying these post-translational modifications. In high-throughput experiments, mass spectrometry database search engines, such as MASCOT provide a ranked list of peptide identifications based on hundreds of thousands of MS/MS spectra obtained in a mass spectrometry experiment. These search results are not in themselves sufficient for confident assignment of phosphorylation sites as identification of characteristic mass differences requires time-consuming manual assessment of the spectra by an experienced analyst. The time required for manual assessment has previously rendered high-throughput confident assignment of phosphorylation sites challenging.
Results: We have developed a knowledge base of criteria, which replicate expert assessment, allowing more than half of cases to be automatically validated and site assignments verified with a high degree of confidence. This was assessed by comparing automated spectral interpretation with careful manual examination of the assignments for 501 peptides above the 1% false discovery rate (FDR) threshold corresponding to 259 putative phosphorylation sites in 74 proteins of the Trypanosoma brucei proteome. Despite this stringent approach, we are able to validate 80 of the 91 phosphorylation sites (88%) positively identified by manual examination of the spectra used for the MASCOT searches with a FDR < 15%.
Conclusions:High-throughput computational analysis can provide a viable second stage validation of primary mass spectrometry database search results. Such validation gives rapid access to a systems level overview of protein phosphorylation in the experiment under investigation.
Availability: A GPL licensed software implementation in Perl for analysis and spectrum annotation is available in the supplementary material and a web server can be assessed online at http://www.compbio.dundee.ac.uk/prophossi
Supplementary information: Supplementary data are available at Bioinformatics online.
High-throughput shotgun proteomics data contain a significant number of spectra from non-peptide ions or spectra of too poor quality to obtain highly confident peptide identifications. These spectra cannot be identified with any positive peptide matches in some database search programs or are identified with false positives in others. Removing these spectra can improve the database search results and lower computational expense.
A new algorithm has been developed to filter tandem mass spectra of poor quality from shotgun proteomic experiments. The algorithm determines the noise level dynamically and independently for each spectrum in a tandem mass spectrometric data set. Spectra are filtered based on a minimum number of required signal peaks with a signal-to-noise ratio of 2. The algorithm was tested with 23 sample data sets containing 62,117 total spectra.
The spectral screening removed 89.0% of the tandem mass spectra that did not yield a peptide match when searched with the MassMatrix database search software. Only 6.0% of tandem mass spectra that yielded peptide matches considered to be true positive matches were lost after spectral screening. The algorithm was found to be very effective at removal of unidentified spectra in other database search programs including Mascot, OMSSA, and X!Tandem (75.93%-91.00%) with a small loss (3.59%-9.40%) of true positive matches.
Mass spectrometry-based proteomics is increasingly used to address basic and clinical questions in biomedical research through studies of differential protein expression, protein-protein interactions, and post-translational modifications. The complex structural and functional organization of the human brain warrants the application of high-throughput, systematic approaches to understand the functional alterations under normal physiological conditions and the perturbations of neurological diseases. This primer focuses on shotgun proteomics based tandem mass spectrometry for the identification of proteins in a complex mixture. It describes the basic concepts of protein differential expression analysis and post-translational modification analysis and discusses several strategies to improve the coverage of the proteome.
Tandem mass spectrometry-based shotgun proteomics has become a widespread technology for analyzing complex protein mixtures. A number of database searching algorithms have been developed to assign peptide sequences to tandem mass spectra. Assembling the peptide identifications to proteins, however, is a challenging issue because many peptides are shared among multiple proteins. IDPicker is an open-source protein assembly tool that derives a minimum protein list from peptide identifications filtered to a specified False Discovery Rate. Here, we update IDPicker to increase confident peptide identifications by combining multiple scores produced by database search tools. By segregating peptide identifications for thresholding using both the precursor charge state and the number of tryptic termini, IDPicker retrieves more peptides for protein assembly. The new version is more robust against false positive proteins, especially in searches using multispecies databases, by requiring additional novel peptides in the parsimony process. IDPicker has been designed for incorporation in many identification workflows by the addition of a graphical user interface and the ability to read identifications from the pepXML format. These advances position IDPicker for high peptide discrimination and reliable protein assembly in large-scale proteomics studies. The source code and binaries for the latest version of IDPicker are available from http://fenchurch.mc.vanderbilt.edu/.
bioinformatics; parsimony; protein assembly; protein inference; false discovery rate
Shotgun proteomics commonly utilizes database search like Mascot to identify proteins from tandem MS/MS spectra. False discovery rate (FDR) is often used to assess the confidence of peptide identifications. However, a widely accepted FDR of 1% sacrifices the sensitivity of peptide identification while improving the accuracy. This article details a machine learning approach combining retention time based support vector regressor (RT-SVR) with q value based statistical analysis to improve peptide and protein identifications with high sensitivity and accuracy. The use of confident peptide identifications as training examples and careful feature selection ensures high R values (>0.900) for all models. The application of RT-SVR model on Mascot results (p=0.10) increases the sensitivity of peptide identifications. q value, as a function of deviation between predicted and experimental RTs(Δ RT), is used to assess the significance of peptide identifications. We demonstrate that the peptide and protein identifications increase by up to 89.4% and 83.5%, respectively, for a specified q value of 0.01 when applying the method to proteomic analysis of the natural killer leukemia cell line (NKL). This study establishes an effective methodology and provides a platform for profiling confident proteomes in more relevant species as well as a future investigation of accurate protein quantification.
tandem mass spectrometry; shotgun proteomics; database search; support vector regressor; retention time; q value; peptide identification; NKL cell
Shotgun proteomics has been used extensively for characterization of a number of proteomes. High resolution Fourier transform mass spectrometry (FTMS) has emerged as a powerful tool owing to its high mass accuracy and resolving power. One of its major limitations, however, is that the confidence level of peptide identification and sensitivity cannot be maximized simultaneously. Although it is generally assumed that higher resolution is better for peptide identifications, the precise effect of varying resolution as a parameter on peptide identification has not yet been systematically evaluated. We used the Escherichia coli proteome and a standard 48 protein mix to study the effect of different resolution parameters on peptide identifications in the setting of a shotgun proteomics experiment on an LTQ-Orbitrap mass spectrometer. We observed a higher number of peptide-spectrum matches (PSMs) whenever the MS scan was carried out by FT and the MS/MS in the ion-trap (IT) with the maximum PSMs obtained at an MS resolution of 30,000. In contrast, when samples were analyzed by FT for both MS and MS/MS, the number of PSMs was significantly lower (~40% as compared to FT-IT experiments) with the maximum PSMs obtained when both the MS and MS/MS resolution were set to 15,000. Thus, a 15K-15K resolution setting may provide the best compromise for studies where both speed and accuracy such as high-throughput post-translational analysis and de novo sequencing are important. We hope that our study will allow researchers to choose between different resolution parameters to achieve their desired results from proteomic analyses.
FTMS; duty cycle; E. coli proteome; PSM
Peptides are routinely identified from mass spectrometry-based proteomics experiments by matching observed spectra to peptides derived from protein databases. The error rates of these identifications can be estimated by target-decoy analysis, which involves matching spectra to shuffled or reversed peptides. Besides estimating error rates, decoy searches can be used by semi-supervised machine learning algorithms to increase the number of confidently identified peptides. As for all machine learning algorithms, however, the results must be validated to avoid issues such as overfitting or biased learning, which would produce unreliable peptide identifications. Here, we discuss how the target-decoy method is employed in machine learning for shotgun proteomics, focusing on how the results can be validated by cross-validation, a frequently used validation scheme in machine learning. We also use simulated data to demonstrate the proposed cross-validation scheme's ability to detect overfitting.
Mass spectrometry-based protein identification methods are fundamental to proteomics. Biological experiments are usually performed in replicates and proteomic analyses generate huge datasets which need to be integrated and quantitatively analyzed. The Sequest™ search algorithm is a commonly used algorithm for identifying peptides and proteins from two dimensional liquid chromatography electrospray ionization tandem mass spectrometry (2-D LC ESI MS2) data. A number of proteomic pipelines that facilitate high throughput 'post data acquisition analysis' are described in the literature. However, these pipelines need to be updated to accommodate the rapidly evolving data analysis methods. Here, we describe a proteomic data analysis pipeline that specifically addresses two main issues pertinent to protein identification and differential expression analysis: 1) estimation of the probability of peptide and protein identifications and 2) non-parametric statistics for protein differential expression analysis. Our proteomic analysis workflow analyzes replicate datasets from a single experimental paradigm to generate a list of identified proteins with their probabilities and significant changes in protein expression using parametric and non-parametric statistics.
The input for our workflow is Bioworks™ 3.2 Sequest (or a later version, including cluster) output in XML format. We use a decoy database approach to assign probability to peptide identifications. The user has the option to select "quality thresholds" on peptide identifications based on the P value. We also estimate probability for protein identification. Proteins identified with peptides at a user-specified threshold value from biological experiments are grouped as either control or treatment for further analysis in ProtQuant. ProtQuant utilizes a parametric (ANOVA) method, for calculating differences in protein expression based on the quantitative measure ΣXcorr. Alternatively ProtQuant output can be further processed using non-parametric Monte-Carlo resampling statistics to calculate P values for differential expression. Correction for multiple testing of ANOVA and resampling P values is done using Benjamini and Hochberg's method. The results of these statistical analyses are then combined into a single output file containing a comprehensive protein list with probabilities and differential expression analysis, associated P values, and resampling statistics.
For biologists carrying out proteomics by mass spectrometry, our workflow facilitates automated, easy to use analyses of Bioworks (3.2 or later versions) data. All the methods used in the workflow are peer-reviewed and as such the results of our workflow are compliant with proteomic data submission guidelines to public proteomic data repositories including PRIDE. Our workflow is a necessary intermediate step that is required to link proteomics data to biological knowledge for generating testable hypotheses.
In shotgun proteomics, a complex protein mixture is digested to peptides, separated and identified by microcapillary liquid chromatography followed by tandem mass spectrometry (LC-MS-MS). In this technology, complete protein digestion is often assumed. We show that, to the contrary, modifications to a standard digestion protocol demonstrate large, reproducible improvements in protein identification, a result consistent with digestion being a limiting factor in the efficiency of protein identification.
mass spectrometry; proteomics; digestion; protein identification
Methods for the global analysis of protein expression offer an approach to study the molecular basis of disease. Studies of protein expression in tissue, such as brain, are complicated by the need for efficient and unbiased digestion of proteins that permit identification of peptides by shotgun proteomic methods. In particular, identification and characterization of less abundant membrane proteins has been of great interest for studies of brain physiology, but often proteins of interest are of low abundance or exist in multiple isoforms. Parsing protein isoforms as a function of disease will be essential. In this study, we develop a digestion scheme using detergents compatible with mass spectrometry that improves membrane protein identification from brain tissue. We show the modified procedure yields close to 5,000 protein identifications from 1.8 mg of rat brain homogenate with an average of 25% protein sequence coverage. This procedure achieves a remarkable reduction in the amount of starting material required to observe a broad spectrum of membrane proteins. Among the proteins identified from a mammalian brain homogenate, 1897 (35%) proteins are annotated by GeneOntology as membrane proteins, and 1225 (22.6%) proteins are predicted to contain at least one transmembrane domain. Membrane proteins identified included neurotransmitter receptors and ion channels implicated in important physiological functions and disease.
brain proteome; mass spectrometry; membrane proteins; proteolysis; shotgun proteomics
The problem of identifying proteins from a shotgun proteomics experiment has not been definitively solved. Identifying the proteins in a sample requires ranking them, ideally with interpretable scores. In particular, “degenerate” peptides, which map to multiple proteins, have made such a ranking difficult to compute. The problem of computing posterior probabilities for the proteins, which can be interpreted as confidence in a protein’s presence, has been especially daunting. Previous approaches have either ignored the peptide degeneracy problem completely, addressed it by computing a heuristic set of proteins or heuristic posterior probabilities, or by estimating the posterior probabilities with sampling methods. We present a probabilistic model for protein identification in tandem mass spectrometry that recognizes peptide degeneracy. We then introduce graph-transforming algorithms that facilitate efficient computation of protein probabilities, even for large data sets. We evaluate our identification procedure on five different well-characterized data sets and demonstrate our ability to efficiently compute high-quality protein posteriors.
In shotgun proteomics, protein identification by tandem mass spectrometry relies on bioinformatics tools. Despite recent improvements in identification algorithms, a significant number of high quality spectra remain unidentified for various reasons. Here we present ScanRanker, an open-source tool that evaluates the quality of tandem mass spectra via sequence tagging with reliable performance in data from different instruments. The superior performance of ScanRanker enables it not only to find unassigned high quality spectra that evade identification through database search, but also to select spectra for de novo sequencing and cross-linking analysis. In addition, we demonstrate that the distribution of ScanRanker scores predicts the richness of identifiable spectra among multiple LC-MS/MS runs in an experiment, and ScanRanker scores assist the process of peptide assignment validation to increase confident spectrum identifications. The source code and executable versions of ScanRanker are available from http://fenchurch.mc.vanderbilt.edu.
spectral quality; sequence tagging; bioinformatics; tandem mass spectrometry; cross-linking
Crop-plant-yield safety is jeopardized by temperature stress caused by the global climate change. To take countermeasures by breeding and/or transgenic approaches it is essential to understand the mechanisms underlying plant acclimation to heat stress. To this end proteomics approaches are most promising, as acclimation is largely mediated by proteins. Accordingly, several proteomics studies, mainly based on two-dimensional gel-tandem MS approaches, were conducted in the past. However, results often were inconsistent, presumably attributable to artifacts inherent to the display of complex proteomes via two-dimensional-gels. We describe here a new approach to monitor proteome dynamics in time course experiments. This approach involves full 15N metabolic labeling and mass spectrometry based quantitative shotgun proteomics using a uniform 15N standard over all time points. It comprises a software framework, IOMIQS, that features batch job mediated automated peptide identification by four parallelized search engines, peptide quantification and data assembly for the processing of large numbers of samples. We have applied this approach to monitor proteome dynamics in a heat stress time course using the unicellular green alga Chlamydomonas reinhardtii as model system. We were able to identify 3433 Chlamydomonas proteins, of which 1116 were quantified in at least three of five time points of the time course. Statistical analyses revealed that levels of 38 proteins significantly increased, whereas levels of 206 proteins significantly decreased during heat stress. The increasing proteins comprise 25 (co-)chaperones and 13 proteins involved in chromatin remodeling, signal transduction, apoptosis, photosynthetic light reactions, and yet unknown functions. Proteins decreasing during heat stress were significantly enriched in functional categories that mediate carbon flux from CO2 and external acetate into protein biosynthesis, which also correlated with a rapid, but fully reversible cell cycle arrest after onset of stress. Our approach opens up new perspectives for plant systems biology and provides novel insights into plant stress acclimation.
Motivation: Proteomics presents the opportunity to provide novel insights about the global biochemical state of a tissue. However, a significant problem with current methods is that shotgun proteomics has limited success at detecting many low abundance proteins, such as transcription factors from complex mixtures of cells and tissues. The ability to assay for these proteins in the context of the entire proteome would be useful in many areas of experimental biology.
Results: We used network-based inference in an approach named SNIPE (Software for Network Inference of Proteomics Experiments) that selectively highlights proteins that are more likely to be active but are otherwise undetectable in a shotgun proteomic sample. SNIPE integrates spectral counts from paired case–control samples over a network neighbourhood and assesses the statistical likelihood of enrichment by a permutation test. As an initial application, SNIPE was able to select several proteins required for early murine tooth development. Multiple lines of additional experimental evidence confirm that SNIPE can uncover previously unreported transcription factors in this system. We conclude that SNIPE can enhance the utility of shotgun proteomics data to facilitate the study of poorly detected proteins in complex mixtures.
Availability and Implementation: An implementation for the R statistical computing environment named snipeR has been made freely available at http://genetics.bwh.harvard.edu/snipe/.
Supplementary information: Supplementary data are available at Bioinformatics online.
Determination of the proteome and identification of biomarkers is required to monitor dynamic changes in living organisms and predict the onset of an illness. One popular method to tackle contemporary proteomic samples is called shotgun proteomics, in which proteins are digested, the resulting peptides are separated by high-performance liquid chromatography (HPLC), and identification is performed with tandem mass-spectrometry. Digestion of proteins typically leads to a very large number of peptides. For example digestion of a cell lysate easily generates 500,000 peptides. The separation of these highly complex peptide samples is one of the major challenges in analytical chemistry. The main strategy to improve the efficiency of packed columns is either to increase column length or by decreasing the size of the stationary phase particles. However, to operate these columns effectively the LC conditions need to be adjusted accordingly. Naturally, the on-line coupling to MS systems has to be taken into account in the optimization process. Here, we report on the performance of nanoLC columns operating at ultra-high pressure. The effects of column parameters (particle size and column length) and LC conditions (gradient time, flow rate, column temperature) were investigated with reversed-phase (RP) gradient nanoLC. High-resolution LC-MS separations of complex proteomic peptide samples are demonstrated by combining long columns with 2 μm particles and long gradients. The effects of LC parameters on performance and the influence on peptide identification are discussed.
Determination of the proteome and identification of biomarkers are required to monitor dynamic changes in living organisms and predict the onset of an illness. One popular method to tackle contemporary proteomic samples is called shotgun proteomics, in which proteins are digested, the resulting peptides are separated by high-performance liquid chromatography (HPLC), and identification is performed with tandem mass spectrometry. Digestion of proteins typically leads to a very large number of peptides. For example, digestion of a cell lysate easily generates 500,000 peptides. The separation of these highly complex peptide samples is one of the major challenges in analytical chemistry. The main strategy to improve the efficiency of packed columns is either to increase column length or to decrease the size of the stationary phase particles. However, to operate these columns effectively, the LC conditions need to be adjusted accordingly. Naturally, the on-line coupling to MS systems has to be taken into account in the optimization process. Here, the authors report on the performance of nano LC columns operating at ultrahigh pressure. The effects of column parameters (particle size and column length) and LC conditions (gradient time, flow rate, column temperature) were investigated with reversed-phase (RP) gradient nano LC. High-resolution LC-MS separations of complex proteomic peptide samples are demonstrated by combining long columns with 2 μm particles and long gradients. The effects of LC parameters on performance and the influence on peptide identification are discussed.
Most biological processes are governed by multiprotein complexes rather than individual proteins. Identification of protein complexes therefore is becoming increasingly important to gain a molecular understanding of cells and organisms. Mass spectrometry–based proteomics combined with affinity-tag-based protein purification is one of the most effective strategies to isolate and identify protein complexes. The development of tandem-affinity purification approaches has revolutionized proteomics experiments. These two-step affinity purification strategies allow rapid, effective purification of protein complexes and, at the same time, minimize background. Identification of even very low-abundant protein complexes with modern sensitive mass spectrometers has become routine. Here, we describe two general strategies for tandem-affinity purification followed by mass spectrometric identification of protein complexes.
tandem-affinity purification (TAP); His-Bio tag; mass spectrometry; protein complex identification; in-vivo cross-linking; MudPIT
Formalin-fixed paraffin-embedded (FFPE) tissue specimens comprise a potentially valuable resource for retrospective biomarker discovery studies, and recent work indicates the feasibility of using shotgun proteomics to characterize FFPE tissue proteins. A critical question in the field is whether proteomes characterized in FFPE specimens are equivalent to proteomes in corresponding fresh or frozen tissue specimens. Here we compared shotgun proteomic analyses of frozen and FFPE specimens prepared from the same colon adenoma tissues. Following deparaffinization, rehydration, and tryptic digestion under mild conditions, FFPE specimens corresponding to 200 μg of protein yielded ∼400 confident protein identifications in a one-dimensional reverse phase liquid chromatography-tandem mass spectrometry (LC-MS/MS) analysis. The major difference between frozen and FFPE proteomes was a decrease in the proportions of lysine C-terminal to arginine C-terminal peptides observed, but these differences had little effect on the proteins identified. No covalent peptide modifications attributable to formaldehyde chemistry were detected by analyses of the MS/MS datasets, which suggests that undetected, cross-linked peptides comprise the major class of modifications in FFPE tissues. Fixation of tissue for up to 2 days in neutral buffered formalin did not adversely impact protein identifications. Analysis of archival colon adenoma FFPE specimens indicated equivalent numbers of MS/MS spectral counts and protein group identifications from specimens stored for 1, 3, 5, and 10 years. Combination of peptide isoelectric focusing-based separation with reverse phase LC-MS/MS identified 2554 protein groups in 600 ng of protein from frozen tissue and 2302 protein groups from FFPE tissue with at least two distinct peptide identifications per protein. Analysis of the combined frozen and FFPE data showed a 92% overlap in the protein groups identified. Comparison of gene ontology categories of identified proteins revealed no bias in protein identification based on subcellular localization. Although the status of posttranslational modifications was not examined in this study, archival samples displayed a modest increase in methionine oxidation, from ∼17% after one year of storage to ∼25% after 10 years. These data demonstrate the equivalence of proteome inventories obtained from FFPE and frozen tissue specimens and provide support for retrospective proteomic analysis of FFPE tissues for biomarker discovery.
Motivation: Enrichment tests are used in high-throughput experimentation to measure the association between gene or protein expression and membership in groups or pathways. The Fisher's exact test is commonly used. We specifically examined the associations produced by the Fisher test between protein identification by mass spectrometry discovery proteomics, and their Gene Ontology (GO) term assignments in a large yeast dataset. We found that direct application of the Fisher test is misleading in proteomics due to the bias in mass spectrometry to preferentially identify proteins based on their biochemical properties. False inference about associations can be made if this bias is not corrected. Our method adjusts Fisher tests for these biases and produces associations more directly attributable to protein expression rather than experimental bias.
Results: Using logistic regression, we modeled the association between protein identification and GO term assignments while adjusting for identification bias in mass spectrometry. The model accounts for five biochemical properties of peptides: (i) hydrophobicity, (ii) molecular weight, (iii) transfer energy, (iv) beta turn frequency and (v) isoelectric point. The model was fit on 181 060 peptides from 2678 proteins identified in 24 yeast proteomics datasets with a 1% false discovery rate. In analyzing the association between protein identification and their GO term assignments, we found that 25% (134 out of 544) of Fisher tests that showed significant association (q-value ≤0.05) were non-significant after adjustment using our model. Simulations generating yeast protein sets enriched for identification propensity show that unadjusted enrichment tests were biased while our approach worked well.
Supplementary information: Supplementary data are available at Bioinformatics online.
In-depth analysis of the salivary proteome is fundamental to understanding the functions of salivary proteins in the oral cavity and to reveal disease biomarkers involved in different pathophysiological conditions, with the ultimate goal of improving patient diagnosis and prognosis. Submandibular and sublingual glands contribute saliva rich in glycoproteins to the total saliva output, making them valuable sources for glycoproteomic analysis. Lectin-affinity chromatography coupled to mass spectrometry-based shotgun proteomics was used to explore the submandibular/sublingual (SM/SL) saliva glycoproteome. A total of 262 N- and O-linked glycoproteins were identified by multidimensional protein identification technology (MudPIT). Only 38 were previously described in SM and SL salivas from the human salivary N-linked glycoproteome, while 224 were unique. Further comparison analysis with SM/SL saliva of the human saliva proteome, revealed 125 glycoproteins not formerly reported in this secretion. KEGG pathway analyses demonstrated that many of these glycoproteins are involved in processes such as complement and coagulation cascades, cell communication, glycosphingolipid biosynthesis neo-lactoseries, O-glycan biosynthesis, glycan structures-biosynthesis 2, starch and sucrose metabolism, peptidoglycan biosynthesis or others pathways. In summary, lectin-affinity chromatography coupled to MudPIT mass spectrometry identified many novel glycoproteins in SM/SL saliva. These new additions to the salivary proteome may prove to be a critical step for providing reliable biomarkers in the diagnosis of a myriad of oral and systemic diseases.
Submandibular/Sublingual saliva; MudPIT; lectin-affinity chromatography; glycoproteins; biomarkers
Tandem mass spectrometry has become a remarkably powerful technology to identify proteins in proteomics. Bioinformatics tools, especially database searching tools, are essential for the interpretation of large quantities of proteomics data. Despite recent improvements in database searching algorithms, only a relatively small fraction of spectra can be confidently assigned to peptide sequences in a typical proteomics analysis. The remaining unassigned spectra often consist of low quality spectra that cause a significant amount of computational overhead but that contribute little to protein identification. On the other hand, many high quality spectra remain unassigned due to modifications, mutations, and the deficiencies of the scoring methods implemented in database searching tools. Here we present ScanRanker, an open-source algorithm that offers a robust method for spectral quality assessment. Unlike existing tools that require training software for each type of instrument to be employed, ScanRanker evaluates quality of tandem mass spectra via sequence tagging, providing reliable performance in data sets from different instruments. The superior performance of ScanRanker enables it not only to filter low quality spectra prior to database searching, but also to find unassigned high quality spectra that evade identification through database search.
This manuscript provides a comprehensive review of the peptide and protein identification process using tandem mass spectrometry (MS/MS) data generated in shotgun proteomic experiments. The commonly used methods for assigning peptide sequences to MS/MS spectra are critically discussed and compared, from basic strategies to advanced multi-stage approaches. A particular attention is paid to the problem of false-positive identifications. Existing statistical approaches for assessing the significance of peptide to spectrum matches are surveyed, ranging from single-spectrum approaches such as expectation values to global error rate estimation procedures such as false discovery rates and posterior probabilities. The importance of using auxiliary discriminant information (mass accuracy, peptide separation coordinates, digestion properties, and etc.) is discussed, and advanced computational approaches for joint modeling of multiple sources of information are presented. This review also includes a detailed analysis of the issues affecting the interpretation of data at the protein level, including the amplification of error rates when going from peptide to protein level, and the ambiguities in inferring the identifies of sample proteins in the presence of shared peptides. Commonly used methods for computing protein-level confidence scores are discussed in detail. The review concludes with a discussion of several outstanding computational issues.
Proteomics; Bioinformatics; Mass Spectrometry; Peptide Identification; Protein Inference; Statistical Models; False Discovery Rates