Proteomics is the large-scale study of proteins, particularly their expression, structures and functions. This still-emerging combination of technologies aims to describe and characterize all expressed proteins in a biological system. Because of upper limits on mass detection of mass spectrometers, proteins are usually digested into peptides and the peptides are then separated, identified and quantified from this complex enzymatic digest. The problem in digesting proteins first and then analyzing the peptide cleavage fragments by mass spectrometry is that huge numbers of peptides are generated that overwhelm direct mass spectral analyses. The objective in the liquid chromatography approach to proteomics is to fractionate peptide mixtures to enable and maximize identification and quantification of the component peptides by mass spectrometry. This review will focus on existing multidimensional liquid chromatographic (MDLC) platforms developed for proteomics and their application in combination with other techniques such as stable isotope labeling. We also provide some perspectives on likely future developments.
multi-dimensional liquid chromatography; stable isotope labeling; label free; proteomics
A goal of proteomics is to distinguish between states of a biological system by identifying protein expression differences. Liu et al. demonstrated a method to perform semi-relative protein quantitation in shotgun proteomics data by correlating the number of tandem mass spectra obtained for each protein, or "spectral count", with its abundance in a mixture; however, two issues have remained open: how to normalize spectral counting data and how to efficiently pinpoint differences between profiles. Moreover, Chen et al. recently showed how to increase the number of identified proteins in shotgun proteomics by analyzing samples with different MS-compatible detergents while performing proteolytic digestion. The latter introduced new challenges as seen from the data analysis perspective, since replicate readings are not acquired.
To address the open issues above, we present a program termed PatternLab for proteomics. This program implements existing strategies and adds two new methods to pinpoint differences in protein profiles. The first method, ACFold, addresses experiments with less than three replicates from each state or having assays acquired by different protocols as described by Chen et al. ACFold uses a combined criterion based on expression fold changes, the AC test, and the false-discovery rate, and can supply a "bird's-eye view" of differentially expressed proteins. The other method addresses experimental designs having multiple readings from each state and is referred to as nSVM (natural support vector machine) because of its roots in evolutionary computing and in statistical learning theory. Our observations suggest that nSVM's niche comprises projects that select a minimum set of proteins for classification purposes; for example, the development of an early detection kit for a given pathology. We demonstrate the effectiveness of each method on experimental data and confront them with existing strategies.
PatternLab offers an easy and unified access to a variety of feature selection and normalization strategies, each having its own niche. Additionally, graphing tools are available to aid in the analysis of high throughput experimental data. PatternLab is available at .
In order to study the differential protein expression in complex biological samples, strategies for rapid, highly reproducible and accurate quantification are necessary. Isotope labeling and fluorescent labeling techniques have been widely used in quantitative proteomics research. However, researchers are increasingly turning to label-free shotgun proteomics techniques for faster, cleaner, and simpler results. Mass spectrometry-based label-free quantitative proteomics falls into two general categories. In the first are the measurements of changes in chromatographic ion intensity such as peptide peak areas or peak heights. The second is based on the spectral counting of identified proteins. In this paper, we will discuss the technologies of these label-free quantitative methods, statistics, available computational software, and their applications in complex proteomics studies.
Peptide identification via tandem mass spectrometry is the basic task of current proteomics research. Due to the complexity of mass spectra, the majority of mass spectra cannot be interpreted at present. The existence of unexpected or unknown protein post-translational modifications is a major reason.
This paper describes an efficient and sequence database-independent approach to detecting abundant post-translational modifications in high-accuracy peptide mass spectra. The approach is based on the observation that the spectra of a modified peptide and its unmodified counterpart are correlated with each other in their peptide masses and retention time. Frequently occurring peptide mass differences in a data set imply possible modifications, while small and consistent retention time differences provide orthogonal supporting evidence. We propose to use a bivariate Gaussian mixture model to discriminate modification-related spectral pairs from random ones. Due to the use of two-dimensional information, accurate modification masses and confident spectral pairs can be determined as well as the quantitative influences of modifications on peptide retention time.
Experiments on two glycoprotein data sets demonstrate that our method can effectively detect abundant modifications and spectral pairs. By including the discovered modifications into database search or by propagating peptide assignments between paired spectra, an average of 10% more spectra are interpreted.
We describe Abacus, a computational tool for extracting spectral counts from tandem mass spectrometry based proteomic datasets. The program aggregates data from multiple experiments, adjusts spectral counts to accurately account for peptides shared across multiple proteins, and performs common normalization steps. It can also output the spectral count data at the gene level, thus simplifying the integration and comparison between gene and protein expression data. Abacus is compatible with the widely used Trans-Proteomic Pipeline suite of tools and comes with a graphical user interface making it easy to interact with the program. The main aim of Abacus is to streamline the analysis of spectral count data by providing an automated, easy to use solution for extracting this information from proteomic datasets for subsequent, more sophisticated statistical analysis.
Label free quantification; spectral counts; software; tandem mass spectrometry; protein inference; shared peptides
Quantification of protein expression by means of mass spectrometry (MS) has been introduced in various proteomics studies. In particular, two label-free quantification methods, such as spectral counting and spectra feature analysis have been extensively investigated in a wide variety of proteomic studies. The cornerstone of both methods is peptide identification based on a proteomic database search and subsequent estimation of peptide retention time. However, they often suffer from restrictive database search and inaccurate estimation of the liquid chromatography (LC) retention time. Furthermore, conventional peptide identification methods based on the spectral library search algorithms such as SEQUEST or SpectraST have been found to provide neither the best match nor high-scored matches. Lastly, these methods are limited in the sense that target peptides cannot be identified unless they have been previously generated and stored into the database or spectral libraries.
To overcome these limitations, we propose a novel method, namely Quantification method based on Finding the Identical Spectral set for a Homogenous peptide (Q-FISH) to estimate the peptide's abundance from its tandem mass spectrometry (MS/MS) spectra through the direct comparison of experimental spectra. Intuitively, our Q-FISH method compares all possible pairs of experimental spectra in order to identify both known and novel proteins, significantly enhancing identification accuracy by grouping replicated spectra from the same peptide targets.
We applied Q-FISH to Nano-LC-MS/MS data obtained from human hepatocellular carcinoma (HCC) and normal liver tissue samples to identify differentially expressed peptides between the normal and disease samples. For a total of 44,318 spectra obtained through MS/MS analysis, Q-FISH yielded 14,747 clusters. Among these, 5,777 clusters were identified only in the HCC sample, 6,648 clusters only in the normal tissue sample, and 2,323 clusters both in the HCC and normal tissue samples. While it will be interesting to investigate peptide clusters only found from one sample, further examined spectral clusters identified both in the HCC and normal samples since our goal is to identify and assess differentially expressed peptides quantitatively. The next step was to perform a beta-binomial test to isolate differentially expressed peptides between the HCC and normal tissue samples. This test resulted in 84 peptides with significantly differential spectral counts between the HCC and normal tissue samples. We independently identified 50 and 95 peptides by SEQUEST, of which 24 and 56 peptides, respectively, were found to be known biomarkers for the human liver cancer. Comparing Q-FISH and SEQUEST results, we found 22 of the differentially expressed 84 peptides by Q-FISH were also identified by SEQUEST. Remarkably, of these 22 peptides discovered both by Q-FISH and SEQUEST, 13 peptides are known for human liver cancer and the remaining 9 peptides are known to be associated with other cancers.
We proposed a novel statistical method, Q-FISH, for accurately identifying protein species and simultaneously quantifying the expression levels of identified peptides from mass spectrometry data. Q-FISH analysis on human HCC and liver tissue samples identified many protein biomarkers that are highly relevant to HCC. Q-FISH can be a useful tool both for peptide identification and quantification on mass spectrometry data analysis. It may also prove to be more effective in discovering novel protein biomarkers than SEQUEST and other standard methods.
The fission yeast Schizosaccharomyces pombe is a widely used model organism to study basic mechanisms of eukaryotic biology, but unlike other model organisms, its proteome remains largely uncharacterized. Using a shotgun proteomics approach based on multidimensional prefractionation and tandem mass spectrometry, we have detected ∼30% of the theoretical fission yeast proteome. Applying statistical modelling to normalize spectral counts to the number of predicted tryptic peptides, we have performed label-free quantification of 1465 proteins. The fission yeast protein data showed considerable correlations with mRNA levels and with the abundance of orthologous proteins in budding yeast. Functional pathway analysis indicated that the mRNA–protein correlation is strong for proteins involved in signalling and metabolic processes, but increasingly discordant for components of protein complexes, which clustered in groups with similar mRNA–protein ratios. Self-organizing map clustering of large-scale protein and mRNA data from fission and budding yeast revealed coordinate but not always concordant expression of components of functional pathways and protein complexes. This finding reaffirms at the protein level the considerable divergence in gene expression patterns of the two model organisms that was noticed in previous transcriptomic studies.
fission yeast; LC-MS/MS; mRNA–protein correlation; relative protein quantification; protein profiling
The in vitro stationary phase proteome of the human pathogen Shigella dysenteriae serotype 1 (SD1) was quantitatively analyzed in Coomassie Blue G250 (CBB)-stained 2D gels. More than four hundred and fifty proteins, of which 271 were associated with distinct gel spots, were identified. In parallel, we employed 2D-LC-MS/MS followed by the label-free computationally modified spectral counting method APEX for absolute protein expression measurements. Of the 4502 genome-predicted SD1 proteins, 1148 proteins were identified with a false positive discovery rate of 5% and quantitated using 2D-LC-MS/MS and APEX. The dynamic range of the APEX method was approximately one order of magnitude higher than that of CBB-stained spot intensity quantitation. A squared Pearson correlation analysis revealed a reasonably good correlation (R2 = 0.67) for protein quantities surveyed by both methods. The correlation was decreased for protein subsets with specific physicochemical properties, such as low Mr values and high hydropathy scores. Stoichiometric ratios of subunits of protein complexes characterized in E. coli were compared with APEX quantitative ratios of orthologous SD1 protein complexes. A high correlation was observed for subunits of soluble cellular protein complexes in several cases, demonstrating versatile applications of the APEX method in quantitative proteomics.
Spectral counting, a promising method for quantifying relative changes in protein abundance in mass spectrometry-based proteomic analysis, was compared to metabolic stable isotope labeling using 15N/14N “heavy/light” peptide pairs. The data were drawn primarily from a Methanococcus maripaludis experiment comparing a wild-type strain with a mutant deficient in a key enzyme relevant to energy metabolism. The dataset contained both proteome and transcriptome measurements. The normalization technique used previously for the isotopic measurements was inappropriate for spectral counting, but a simple adjustment for sampling frequency was sufficient for normalization. This adjustment was satisfactory both for M. maripaludis, an organism that showed relatively little expression change between the wild-type and mutant strains, and Porphyromonas gingivalis, an intracellular pathogen that has demonstrated widespread changes between intracellular and extracellular conditions. Spectral counting showed lower overall sensitivity defined in terms of detecting a two-fold change in protein expression, and in order to achieve the same level of quantitative proteome coverage as the stable isotope method, it would have required approximately doubling the number of mass spectra collected.
The analysis of tandem mass (MS/MS) data to identify and quantify proteins is hampered by the heterogeneity of file formats at the raw spectral data, peptide identification, and protein identification levels. Different mass spectrometers output their raw spectral data in a variety of proprietary formats, and alternative methods that assign peptides to MS/MS spectra and infer protein identifications from those peptide assignments each write their results in different formats. Here we describe an MS/MS analysis platform, the Trans-Proteomic Pipeline, which makes use of open XML file formats for storage of data at the raw spectral data, peptide, and protein levels. This platform enables uniform analysis and exchange of MS/MS data generated from a variety of different instruments, and assigned peptides using a variety of different database search programs. We demonstrate this by applying the pipeline to data sets generated by ThermoFinnigan LCQ, ABI 4700 MALDI-TOF/TOF, and Waters Q-TOF instruments, and searched in turn using SEQUEST, Mascot, and COMET.
analysis platform; open XML; proteomics
A key problem in computational proteomics is distinguishing between correct and false peptide identifications. We argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. We show that the generating functions and their derivatives (spectral energy and spectral probability) represent new features of tandem mass spectra that, similarly to Δ-scores, significantly improve peptide identifications. Furthermore, the spectral probability provides a rigorous solution to the problem of computing statistical significance of spectral identifications. The spectral energy/probability approach improves the sensitivity-specificity trade-off of existing MS/MS search tools, addresses the notoriously difficult problem of “one-hit-wonders” in mass spectrometry, and often eliminates the need for decoy database searches. We therefore argue that the generating function approach has the potential to increase the number of peptide identifications in MS/MS searches.
For proteomic analysis using tandem mass spectrometry, linear ion trap instruments provide unsurpassed sensitivity, but unreliably detect low mass peptide fragments, precluding their use with iTRAQ reagent labeled samples. While the popular LTQ linear ion trap supports analyzing iTRAQ reagent labeled peptides via pulsed Q dissociation, PQD, its effectiveness remains questionable. Using a standard mixture, we found careful tuning of relative collision energy necessary for fragmenting iTRAQ reagent labeled peptides, and increasing microscan acquisition and repeat count improves quantification, but identifies somewhat fewer peptides. We developed software to calculate abundance ratios via summing reporter ion intensities across spectra matching to each protein, thereby providing maximized accuracy. Testing found results closely corresponded between analysis using optimized LTQ-PQD settings plus our software and using a Qstar instrument. Thus, we demonstrate the effectiveness of LTQ-PQD analyzing iTRAQ reagent labeled peptides, and provide guidelines for successful quantitative proteomic studies.
quantitative proteomics; iTRAQ; linear ion trap; pulsed-Q-dissociation
Shotgun proteomics provides the most powerful analytical platform for global inventory of complex proteomes using liquid chromatography−tandem mass spectrometry (LC−MS/MS) and allows a global analysis of protein changes. Nevertheless, sampling of complex proteomes by current shotgun proteomics platforms is incomplete, and this contributes to variability in assessment of peptide and protein inventories by spectral counting approaches. Thus, shotgun proteomics data pose challenges in comparing proteomes from different biological states. We developed an analysis strategy using quasi-likelihood Generalized Linear Modeling (GLM), included in a graphical interface software package (QuasiTel) that reads standard output from protein assemblies created by IDPicker, an HTML-based user interface to query shotgun proteomic data sets. This approach was compared to four other statistical analysis strategies: Student t test, Wilcoxon rank test, Fisher’s Exact test, and Poisson-based GLM. We analyzed the performance of these tests to identify differences in protein levels based on spectral counts in a shotgun data set in which equimolar amounts of 48 human proteins were spiked at different levels into whole yeast lysates. Both GLM approaches and the Fisher Exact test performed adequately, each with their unique limitations. We subsequently compared the proteomes of normal tonsil epithelium and HNSCC using this approach and identified 86 proteins with differential spectral counts between normal tonsil epithelium and HNSCC. We selected 18 proteins from this comparison for verification of protein levels between the individual normal and tumor tissues using liquid chromatography−multiple reaction monitoring mass spectrometry (LC−MRM-MS). This analysis confirmed the magnitude and direction of the protein expression differences in all 6 proteins for which reliable data could be obtained. Our analysis demonstrates that shotgun proteomic data sets from different tissue phenotypes are sufficiently rich in quantitative information and that statistically significant differences in proteins spectral counts reflect the underlying biology of the samples.
Shotgun proteomics provides the most powerful analytical platform for global inventory of complex proteomes but incomplete sampling poses challenges in comparing protein inventories by spectral counting approaches. We developed a statistical method based on quasi-likelihood modeling and demonstrate that it compares favorably to other statistical tests. Statistically significant spectral count differences were confirmed by MRM demonstrating that the observed protein level differences reflect the underlying biology of the samples.
LC−MS/MS; shotgun proteomics; multiple reaction monitoring (MRM); head and neck carcinoma; Generalized Linear Model; spectral counting
Spectral counting is a strategy to quantitate relative protein concentrations in pre-digested protein mixtures analyzed by liquid chromatography online with tandem mass spectrometry. In this work we used combinations of normalization and statistical (feature selection) methods on spectral counting data to verify whether we could pinpoint which and how many proteins were differentially expressed when comparing complex protein mixtures. These combinations were evaluated on real, but controlled, experiments (protein markers were spiked into yeast lysates in different concentrations to simulate differences), which are therefore verifiable. The following normalization methods were applied: total signal, Z-normalization, hybrid normalization, and log preprocessing. The feature selection methods were: Golub's index, Student's t-test, a strategy based on the weighting used in a support vector machine model (SVM-F), and support vector machine recursive feature elimination. The results showed that Z-normalization combined with SVM-F correctly identified which and how many protein markers were added to the yeast lysates for all different concentrations. The software we used is available at http://pcarvalho.com/patternlab.
MudPIT; feature selection; SVM; spectral counting; feature ranking
Differential analysis of whole cell proteomes by mass spectrometry has largely been applied using various forms of stable isotope labeling. While metabolic stable isotope labeling has been the method of choice, it is often not possible to apply such an approach. Four different label free ways of calculating expression ratios in a classic “two-state” experiment are compared: signal intensity at the peptide level, signal intensity at the protein level, spectral counting at the peptide level, and spectral counting at the protein level. The quantitative data were mined from a dataset of 1245 qualitatively identified proteins, about 56% of the protein encoding open reading frames from Porphyromonas gingivalis, a Gram-negative intracellular pathogen being studied under extracellular and intracellular conditions. Two different control populations were compared against P. gingivalis internalized within a model human target cell line. The q-value statistic, a measure of false discovery rate previously applied to transcription microarrays, was applied to proteomics data. For spectral counting, the most logically consistent estimate of random error came from applying the locally weighted scatter plot smoothing procedure (LOWESS) to the most extreme ratios generated from a control technical replicate, thus setting upper and lower bounds for the region of experimentally observed random error.
spectral count; Porphyromonas gingivalis; q-value; quantitative proteomics; G test
The field of proteomics involves the characterization of the peptides and proteins expressed in a cell under specific conditions. Proteomics has made rapid advances in recent years following the sequencing of the genomes of an increasing number of organisms. A prominent technology for high throughput proteomics analysis is the use of liquid chromatography coupled to Fourier transform ion cyclotron resonance mass spectrometry (LC-FTICR-MS). Meaningful biological conclusions can best be made when the peptide identities returned by this technique are accompanied by measures of accuracy and confidence.
After a tryptically digested protein mixture is analyzed by LC-FTICR-MS, the observed masses and normalized elution times of the detected features are statistically matched to the theoretical masses and elution times of known peptides listed in a large database. The probability of matching is estimated for each peptide in the reference database using statistical classification methods assuming bivariate Gaussian probability distributions on the uncertainties in the masses and the normalized elution times.
A database of 69,220 features from 32 LC-FTICR-MS analyses of a tryptically digested bovine serum albumin (BSA) sample was matched to a database populated with 97% false positive peptides. The percentage of high confidence identifications was found to be consistent with other database search procedures. BSA database peptides were identified with high confidence on average in 14.1 of the 32 analyses. False positives were identified on average in just 2.7 analyses.
Using a priori probabilities that contrast peptides from expected and unexpected proteins was shown to perform better in identifying target peptides than using equally likely a priori probabilities. This is because a large percentage of the target peptides were similar to unexpected peptides which were included to be false positives. The use of triplicate analyses with a "2 out of 3" reporting rule was shown to have excellent rejection of false positives.
One of the most popular methods to prepare tryptic peptides for bottom-up proteomic analysis is in-gel digestion. To date, there have been few studies comparing various digestion methods. In this study, we compare the efficiency of several popular in-gel digestion methods, along with new technologies that may improve digestion efficiency, using a human epidermoid carcinoma cell lysate protein standard. The efficiency of each protocol was based on the average number of proteins identified and their respective sequence coverage and relative quantitation using spectral counting. The importance of this study lies in its comparison of pre-existing in-gel digestion methods with those that use newly developed technologies that may introduce the potential for a more cost-effective digestion, higher protein yield, and an overall reduction in processing time. The following four protocols were compared: an overnight in-gel digestion protocol; an overnight in-gel digestion protocol, in which we remove the vacuum centrifugation steps; in-gel digestion in a barometric pressure cycler; and in-gel digestion in a scientific microwave. Several variables were tested for increased digestion efficiency and decreased keratin contamination. Statistical analysis was performed on replicate samples to determine significant differences between protocols.
mass spectrometry; proteomics
One of the most popular methods to prepare tryptic peptides for bottom-up proteomic analysis is in-gel digestion. To date, there have been few studies comparing various digestion protocols. In this study we compare the efficiency of several popular in-gel digestion protocols along with new pieces of technology that may improve digestion efficiency, using a human epidermoid carcinoma cell lysate protein standard. The efficiency of each protocol will be based on the number of proteins identified, their respective sequence coverage and relative quantitation using spectral counting. The importance of this study lies in its comparison of pre-existing in-gel digestion methods and newly developed technologies. These new technologies introduce the potential for a more cost effective digestion, higher protein yield and an overall reduction in time. The following four protocols will be compared: Shevchenko's overnight protocol (Methods in Molecular Biology 1999;122:383-397), in-gel digestion in a barometric pressure cycler (Pressure Biosciences, Boston, MA), and in-gel digestion in a scientific microwave (CEM, Mathews, NC). In addition several variables will be tested for increased digestion efficiency and keratin contamination including the elimination of vacuum centrifugation and the use of modified and non-modified trypsin. Statistical analysis will be performed on replicate samples to determine if there are any significant differences between protocols.
Plasma biomarkers studies are based on the differential expression of proteins between different treatment groups or between diseased and control populations. Most mass spectrometry-based methods of protein quantitation, however, are based on the detection and quantitation of peptides, not intact proteins. For peptide-based protein quantitation to be accurate, the digestion protocols used in proteomic analyses must be both efficient and reproducible. There have been very few studies, however, where plasma denaturation/digestion protocols have been compared using absolute quantitation methods. In this paper, 14 combinations of heat, solvent [acetonitrile, methanol, trifluoroethanol], chaotropic agents [guanidine hydrochloride, urea], and surfactants [sodium dodecyl sulfate (SDS) and sodium deoxycholate (DOC)] were compared with respect to their effectiveness in improving subsequent tryptic digestion. These digestion protocols were evaluated by quantitating the production of proteotypic tryptic peptides from 45 moderate- to high-abundance plasma proteins, using tandem mass spectrometry in multiple reaction monitoring mode, with a mixture of stable-isotope labeled analogues of these proteotypic peptides as internal standards. When the digestion efficiencies of these 14 methods were compared, we found that both of the surfactants (SDS and DOC) produced an increase in the overall yield of tryptic peptides from these 45 proteins, when compared to the more commonly used urea protocol. SDS, however, can be a serious interference for subsequent mass spectrometry. DOC, on the other hand, can be easily removed from the samples by acid precipitation. Examining the results of a reproducibility study, done with 5 replicate digestions, DOC and SDS with a 9 h digestion time produced the highest average digestion efficiencies (~80%), with the highest average reproducibility (<5% error, defined as the relative deviation from the mean value). However, because of potential interferences resulting from the use of SDS, we recommend DOC with a 9 h digestion procedure as the optimum protocol.
protein digestion; deoxycholate; urea; sodium dodecyl sulfate; heat denaturation; solvent denaturation
Immunoaffinity depletion with antibodies to the top 7 or top 14 high abundance plasma proteins is used to enhance detection of lower abundance proteins in both shotgun and targeted proteomic analyses. We evaluated the effects of top 7/top 14 immunodepletion on the shotgun proteomic analysis of human plasma. Our goal was to evaluate the impact of immunodepletion on detection of proteins across detectable ranges of abundance. The depletion columns afforded highly repeatable and efficient plasma protein fractionation. Relatively few nontargeted proteins were captured by the depletion columns. Analyses of unfractionated and immunodepleted plasma by peptide isoelectric focusing (IEF), followed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) demonstrated enrichment of nontargeted plasma proteins by an average of 4-fold, as assessed by MS/MS spectral counting. Either top 7 or top 14 immunodepletion resulted in a 25% increase in identified proteins compared to unfractionated plasma. Although 23 low abundance (<10 ng mL−1) plasma proteins were detected, they accounted for only 5–6% of total protein identifications in immunodepleted plasma. In both unfractionated and immunodepleted plasma, the 50 most abundant plasma proteins accounted for 90% of cumulative spectral counts and precursor ion intensities, leaving little capacity to sample lower abundance proteins. Untargeted proteomic analyses using current LC-MS/MS platforms—even with immunodepletion—cannot be expected to efficiently discover low abundance, disease-specific biomarkers in plasma.
plasma; high-abundance protein depletion; multiple affinity removal system; isoelectric focusing; shotgun proteomics
The primary utility of trypsin digestion in proteomics is that it cleaves proteins at predictable locations, but it is also notable for yielding peptides that terminate in basic arginine and lysine residues. Tryptic peptides fragment in ion trap tandem mass spectrometry to produce prominent C-terminal y series ions. Alternative proteolytic digests may produce peptides that do not follow these rules. In this study, we examine 2568 peptides generated through proteinase K digestion, a technique that produces a greater diversity of basic residue content in peptides. We show that the position of basic residues within peptides influences the peak intensities of b and y series ions; a basic residue near the N-terminus of a peptide can lead to prominent b series peaks rather than the intense y series peaks associated with tryptic peptides. The effects of presence and position for arginine, lysine, and histidine are explored separately and in combination. Arg shows the most dominant effects followed by His and then by Lys. Fragment ions containing basic residues produce more intense peaks than those without basic residues. Doubly charged precursor ions have generally been modeled as producing only singly charged fragment ions, but fragment ions that contain two basic residues may accept both protons during fragmentation. By characterizing the influence of basic residues on gas-phase fragmentation of peptides, this research makes possible more accurate fragmentation models for peptide identification algorithms.
LC-MS/MS and automated protein library searching provide a high-throughput strategy for peptide sequence assignments for identification of qualitative differences in comparative shotgun proteomics. However, the rapid increase in size and depth of analysis requires advancement in the speed, scale and flexibility of tools available for parsimony level data analysis. MassSieve has been developed as a platform for parsimony analysis of large scale MS/MS experiments, both for single and comparative analysis.
MassSieve supports reports from multiple search engines with differing probability-based search characteristics which can increase peptide sequence coverage and/ or identify conflicting or ambiguous spectral assignments. Label-free relative quantitative information is also available by spectral hit counts per peptide and per protein. Graphical display of each set of related peptides and proteins as well as user defined treatment of indeterminate peptides provides for visualization of possible isoforms or conflicting database entries.
Due to the possibility of a biothreat attack on civilian or military installations, a need exists for technologies that can detect and accurately identify pathogens in a near-real-time approach. One technology potentially capable of meeting these needs is a high-throughput mass spectrometry (MS)-based proteomic approach. This approach utilizes the knowledge of amino acid sequences of peptides derived from the proteolysis of proteins as a basis for reliable bacterial identification. To evaluate this approach, the tryptic digest peptides generated from double-blind biological samples containing either a single bacterium or a mixture of bacteria were analyzed using liquid chromatography-tandem mass spectrometry. Bioinformatic tools that provide bacterial classification were used to evaluate the proteomic approach. Results showed that bacteria in all of the double-blind samples were accurately identified with no false-positive assignment. The MS proteomic approach showed strain-level discrimination for the various bacteria employed. The approach also characterized double-blind bacterial samples to the respective genus, species, and strain levels when the experimental organism was not in the database due to its genome not having been sequenced. One experimental sample did not have its genome sequenced, and the peptide experimental record was added to the virtual bacterial proteome database. A replicate analysis identified the sample to the peptide experimental record stored in the database. The MS proteomic approach proved capable of identifying and classifying organisms within a microbial mixture.
Recently there has been an increasing interest in using spectral searching as an alternative to traditional database sequence searching methods for peptide identification from tandem mass spectrometry. In spectral searching, the query spectrum is compared to a carefully compiled library of previously observed and identified spectra; high spectral similarity signals positive identification. We have previously developed an open-source software toolkit, SpectraST, to enable proteomics researchers to integrate spectral searching into their data analysis pipeline. Here we report an additional module to SpectraST that provides the functionality of spectral library building, allowing users to build custom libraries when public spectral libraries do not adequately meet their needs. A consensus creation algorithm was developed to coalesce replicate spectra identified to the same peptide ion. Various quality filters were implemented to remove questionable and low-quality spectra from the library. To validate the methodology, we first compiled a spectral library from the 1.3 million SEQUEST-identified spectra (29,109 distinct peptide ions) among the publicly released datasets in the Human Plasma PeptideAtlas, a collection of 40 contributed, heterogeneous shotgun proteomics datasets, and verified the effectiveness of the library building algorithm to generate high-quality, representative consensus spectra and to remove questionable spectra. We then re-searched the same datasets by SpectraST against this spectral library filtered at different quality levels, and used the performance as a benchmark to evaluate our library building methods and to determine key parameters for high-quality library building. We demonstrated the importance of library quality on the performance of spectral searching. The ready-to-deploy software allows individual researchers to easily condense their raw data into specialized spectral libraries, summarizing useful information about their observed proteomes into a concise and retrievable format for future data analyses.
The use of multidimensional capillary HPLC combined with tandem mass spectrometry has allowed high qualitative and quantitative proteome coverage of prokaryotic organisms. The determination of protein abundance change between two or more conditions has matured to the point that false discovery rates can be very low and for smaller proteomes coverage is sufficiently high to explicitly consider false negative error. Selected aspects of using these methods for global protein abundance assessments are reviewed. These include instrumental issues that influence the reliability of abundance ratios; a comparison of sources of non-linearity, errors, and data compression in proteomics and spotted cDNA arrays; strengths and weaknesses of spectral counting versus stable isotope metabolic labeling; and a survey of microbiological applications of global abundance analysis at the protein level. Proteomic results for two organisms that have been studied extensively using these methods are reviewed in greater detail. Spectral counting and metabolic labeling data are compared and the utility of proteomics for global gene regulation studies are discussed for the methanogenic Archaeon Methanococcus maripaludis. The oral pathogen Porphyromonas gingivalis is discussed as an example of an organism where a large percentage of the proteome differs in relative abundance between the intracellular and extracellular phenotype.
Differential protein abundance; Tandem mass spectra; Quantitative analysis; Multidimensional liquid chromatography; Prokaryote; False negative; False positive