Summary: YADA can deisotope and decharge high-resolution mass spectra from large peptide molecules, link the precursor monoisotopic peak information to the corresponding tandem mass spectrum, and account for different co-fragmenting ion species (multiplexed spectra). We describe how YADA enables a pipeline consisting of ProLuCID and DTASelect for analyzing large-scale middle-down proteomics data.
Supplementary information: Supplementary data are available at Bioinformatics online.
Stable isotope tracing with ultra-high resolution Fourier transform-ion cyclotron resonance-mass spectrometry (FT-ICR-MS) can provide simultaneous determination of hundreds to thousands of metabolite isotopologue species without the need for chromatographic separation. Therefore, this experimental metabolomics methodology may allow the tracing of metabolic pathways starting from stable-isotope-enriched precursors, which can improve our mechanistic understanding of cellular metabolism. However, contributions to the observed intensities arising from the stable isotope's natural abundance must be subtracted (deisotoped) from the raw isotopologue peaks before interpretation. Previously posed deisotoping problems are sidestepped due to the isotopic resolution and identification of individual isotopologue peaks. This peak resolution and identification come from the very high mass resolution and accuracy of FT-ICR-MS and present an analytically solvable deisotoping problem, even in the context of stable-isotope enrichment.
We present both a computationally feasible analytical solution and an algorithm to this newly posed deisotoping problem, which both work with any amount of 13C or 15N stable-isotope enrichment. We demonstrate this algorithm and correct for the effects of 13C natural abundance on a set of raw isotopologue intensities for a specific phosphatidylcholine lipid metabolite derived from a 13C-tracing experiment.
Correction for the effects of 13C natural abundance on a set of raw isotopologue intensities is computationally feasible when the raw isotopologues are isotopically resolved and identified. Such correction makes qualitative interpretation of stable isotope tracing easier and is required before attempting a more rigorous quantitative interpretation of the isotopologue data. The presented implementation is very robust with increasing metabolite size. Error analysis of the algorithm will be straightforward due to low relative error from the implementation itself. Furthermore, the algorithm may serve as an independent quality control measure for a set of observed isotopologue intensities.
Protonated molecular peptide ions and their product ions generated by tandem mass spectrometry appear as isotopologue clusters due to the natural isotopic variations of carbon, hydrogen, nitrogen, oxygen and sulfur. Quantitation of the isotopic composition of peptides can be employed in experiments involving isotope effects, isotope exchange, isotopic labeling by chemical reactions, and studies of metabolism by stable isotope incorporation. Both ion trap and quadrupole-time of flight mass spectrometry are shown to be capable of determining the isotopic composition of peptide product ions obtained by tandem mass spectrometry with both precision and accuracy. Tandem mass spectra obtained in profile-mode of clusters of isotopologue ions are fit by non-linear least squares to a series of Gaussian peaks (described in the accompanying manuscript) which quantify the Mn/M0 values which define the isotopologue distribution (ID). To determine the isotopic composition of product ions from their ID, a new algorithm that predicts the Mn/M0 ratios is developed which obviates the need to determine the intensity of all of the ions of an ID. Consequently a precise and accurate determination of the isotopic composition a product ion may be obtained from only the initial values of the ID, however the entire isotopologue cluster must be isolated prior to fragmentation. Following optimization of the molecular ion isolation width, fragmentation energy and detector sensitivity, the presence of isotopic excess (2H, 13C, 15N, 18O) is readily determined within 1%. The ability to determine the isotopic composition of sequential product ions permits the isotopic composition of individual amino acid residues in the precursor ion to be determined.
isotopologue distribution; mass isotopomer distribution; tandem mass spectrometry; deuterium incorporation; isotopic excess; isotope quantitation; H/D exchange; protein turnover
We developed the LipidQA (Lipid Qualitative/Quantitative Analysis) software platform to identify and quantitate complex lipid molecular species in biological mixtures. LipidQA can process raw electronic data files from the TSQ-7000 triple stage quadrupole and LTQ linear ion trap mass spectrometers from Thermo-Finnigan and the Q-TOF hybrid quadrupole/time-of-flight instrument from Waters-Micromass and could readily be modified to accommodate data from others. The program processes multiple spectra in a few seconds and includes a deisotoping algorithm that increases the accuracy of structural identification and quantitation. Identification is achieved by comparing MS2 spectra obtained in a data-dependent manner to a library of reference spectra of complex lipids that we have acquired or constructed from established fragmentation rules. The current form of the algorithm can process data acquired in negative or positive ion mode for glycerophospholipid species of all major head-group classes.
glycerophospholipid; electrospray ionization; computer algorithm; quantitation; identification; database searching
Tandem mass spectrometry (MS/MS) is frequently used in the identification of peptides and proteins. Typical proteomic experiments rely on algorithms such as SEQUEST and MASCOT to compare thousands of tandem mass spectra against the theoretical fragment ion spectra of peptides in a database. The probabilities that these spectrum-to-sequence assignments are correct can be determined by statistical software such as PeptideProphet or through estimations based on reverse or decoy databases. However, many of the software applications that assign probabilities for MS/MS spectra to sequence matches were developed using training datasets from 3D ion-trap mass spectrometers. Given the variety of types of mass spectrometers that have become commercially available over the last five years, we sought to generate a dataset of reference data covering multiple instrumentation platforms to facilitate both the refinement of existing computational approaches and the development of novel software tools. We analyzed the proteolytic peptides in a mixture of tryptic digests of 18 proteins, named the “ISB standard protein mix”, using 8 different mass spectrometers. These include linear and 3D ion traps, two quadrupole time-of-flight platforms (qq-TOF) and two MALDI-TOF-TOF platforms. The resulting dataset, which has been named the Standard Protein Mix Database, consists of over 1.1 million spectra in 150+ replicate runs on the mass spectrometers. The data were inspected for quality of separation and searched using SEQUEST. All data, including the native raw instrument and mzXML formats and the PeptideProphet validated peptide assignments, are available at http://regis-web.systemsbiology.net/PublicDatasets/.
Proteomics; reference dataset; database search software; standard protein mix; Standard Protein Mix Database
Peptide sequence identification using tandem mass spectroscopy remains a major challenge for complex proteomic studies. Peptide matching algorithms require the accurate determination of both the mass and charge of the precursor ion and accommodate uncertainties in these properties by using a wide precursor mass tolerance and by testing, for each spectrum, several possible candidate charges. Using a data acquisition strategy that includes obtaining narrow mass-range MS1 “zoom” scans, we describe here a post-acquisition algorithm dubbed MAZIE, that accurately determines the charge and monoisotopic mass of precursor ions on a low-resolution Thermo LTQ-XL mass spectrometer. This is achieved by examining the isotopic distribution obtained in the preceding MS1 zoom spectrum and comparing to theoretical distributions for candidate charge states from +1 to +4. MAZIE then writes modified data files with the corrected monoisotopic mass and charge. We have validated MAZIE results by comparing the sequence search results obtained with the MAZIE-generated data files to results using the unmodified data files. Using two different search algorithms and a false discovery rate filter, we found that MAZIE-interpreted data resulted in 80% (using SEQUEST) and 30% (using OMSSA) more high-confidence sequence identifications. Analyses of these results indicate that the accurate determination of the precursor ion mass greatly facilitates the ability to differentiate between true and false positive matches, while the determination of the precursor ion charge reduces the overall search time but does not significantly reduce the ambiguity of interpreting the search results. MAZIE is distributed as an open-source PERL script.
We describe a method for assessing the quality of mass spectra and improving reliability of relative ratio estimations from 18O-water labeling experiments acquired from low resolution mass spectrometers. The mass profiles of heavy and light peptide pairs are often affected by artifacts, including co-eluting contaminant species, noise signal, instrumental fluctuations in measuring ion position and abundance levels. Such artifacts distort the profiles, leading to erroneous ratio estimations thus reducing the reliability of ratio estimations in high throughput quantification experiments.
We used support vector machines (SVMs) to filter out mass spectra that deviated significantly from expected theoretical isotope distributions. We built an SVM classifier with a decision function which assigns a score to every mass profile based on such spectral features as mass accuracy, signal-to-noise ratio, and differences between experimental and theoretical isotopic distributions.
The classifier was trained using a dataset obtained from samples of mouse renal cortex. We then tested it on protein samples (bovine serum albumin), mixed in five different ratios of labeled and unlabeled species. We demonstrated that filtering the data using our SVM classifier results in as much as a nine-fold reduction in the coefficient of variance of peptide ratios, thus significantly improving the reliability of ratio estimations.
support vector machines; stable-isotope labeling; signal-to-noise ratio; isotope distribution; mass accuracy
Although tandem mass spectrometry (MS/MS) has become an integral part of proteomics, intensity patterns in MS/MS spectra are rarely weighted heavily in most widely used algorithms because they are not yet fully understood. Here a knowledge mining approach is demonstrated to discover fragmentation intensity patterns and elucidate the chemical factors behind such patterns. Fragmentation intensity information from 28 330 ion trap peptide MS/MS spectra of different charge states and sequences went through unsupervised clustering using a penalized K-means algorithm. Without any prior chemistry assumptions, four clusters with distinctive fragmentation patterns were obtained. A decision tree was generated to investigate peptide sequence motif and charge state status that caused these fragmentation patterns. This data-mining scheme is generally applicable for any large data sets. It bypasses the common prior knowledge constraints and reports on the overall peptide fragmentation behavior. It improves the understanding of gas-phase peptide dissociation and provides a foundation for new or improved protein identification algorithms.
data mining; cluster analysis; K-means algorithm; penalized K-means algorithm; CART; statistical analysis; quantile map; peptide; MS/MS; intensity; ion trap; CID; fragmentation pattern; dissociation pattern; pairwise cleavage; Xxx-Zzz; cleavage pair; Fisher information; Pro; Gly; Asp; Glu; Arg; Lys
High-resolution tandem mass spectra can now be readily acquired with hybrid instruments, such as LTQ-Orbitrap and LTQ-FT, in high-throughput shotgun proteomics workflows. The improved spectral quality enables more accurate de novo sequencing for identification of post-translational modifications and amino acid polymorphisms.
In this study, a new de novo sequencing algorithm, called Vonode, has been developed specifically for analysis of such high-resolution tandem mass spectra. To fully exploit the high mass accuracy of these spectra, a unique scoring system is proposed to evaluate sequence tags based primarily on mass accuracy information of fragment ions. Consensus sequence tags were inferred for 11,422 spectra with an average peptide length of 5.5 residues from a total of 40,297 input spectra acquired in a 24-hour proteomics measurement of Rhodopseudomonas palustris. The accuracy of inferred consensus sequence tags was 84%. According to our comparison, the performance of Vonode was shown to be superior to the PepNovo v2.0 algorithm, in terms of the number of de novo sequenced spectra and the sequencing accuracy.
Here, we improved de novo sequencing performance by developing a new algorithm specifically for high-resolution tandem mass spectral data. The Vonode algorithm is freely available for download at http://compbio.ornl.gov/Vonode.
Collision-induced dissociation (CID) is a common ion activation technique used to energize mass-selected peptide ions during tandem mass spectrometry. Characteristic fragment ions form from the cleavage of amide bonds within a peptide undergoing CID, allowing the inference of its amino acid sequence. The statistical characterization of these fragment ions is essential for improving peptide identification algorithms and for understanding the complex reactions taking place during CID. An examination of 1465 ion trap spectra from doubly charged tryptic peptides reveals several trends important to understanding this fragmentation process. While less abundant than y ions, b ions are present in sufficient numbers to aid sequencing algorithms. Fragment ions exhibit a characteristic series-specific relationship between their masses and intensities. Each residue influences fragmentation at adjacent amide bonds, with Pro quantifiably enhancing cleavage at its N-terminal amide bond and His increasing the formation of b ions at its C-terminal amide bond. Fragment ions corresponding to a formal loss of ammonia appear preferentially in peptides containing Gln and Asn. These trends are partially responsible for the complexity of peptide tandem mass spectra.
Tandem mass spectrometry has become particularly useful for the rapid identification and characterization of protein components of complex biological mixtures. Powerful database search methods have been developed for the peptide identification, such as SEQUEST and MASCOT, which are implemented by comparing the mass spectra obtained from unknown proteins or peptides with theoretically predicted spectra derived from protein databases. However, the majority of spectra generated from a mass spectrometry experiment are of too poor quality to be interpreted while some of spectra with high quality cannot be interpreted by one method but perhaps by others. Hence a filtering algorithm that removes those spectra with poor quality prior to the database search is appealing.
This paper proposes a support vector machine (SVM) based approach to assess the quality of tandem mass spectra. Each mass spectrum is mapping into the 16 proposed features to describe its quality. Based the results from SEQUEST, four SVM classifiers with the input of the 16 features are trained and tested on ISB data and TOV data, respectively. The superior performance of the proposed SVM classifiers is illustrated both by the comparison with the existing classifiers and by the validation in terms of MASCOT search results.
The proposed method can be employed to effectively remove the poor quality spectra before the spectral searching, and also to find the more peptides or post-translational peptides from spectra with high quality using different search engines or de novo method.
Currently, the tandem mass spectrometry (MSMS) of peptides is a dominant technique used to identify peptides and consequently proteins. The peptide fragmentation inside the mass analyzer typically offers a spectrum containing several different groups of ions. The mass to charge (m/z) values of these ions can be exactly calculated following simple rules based on the possible peptide fragmentation reactions. But the (relative) intensities of the particular ions cannot be simply predicted from the amino-acid sequence of the peptide. This study presents initial work towards developing a theoretical fundamental approach to ion intensity elucidation by utilizing quantum mechanical computations.
MSMS spectra of the doubly charged GAVLK peptide were collected on electrospray ion trap mass spectrometers using low energy modes of fragmentation. Density functional theory (DFT) calculations were performed on the population of ion precursors to determine the fragment ion intensities corresponding to a Boltzmann distribution of the protonation of nitrogens in the peptide backbone amide bonds.
We were able to a) predict the y and b ions intensities order in concert with the experimental observation; b) predict relative intensities of y ions with errors not exceeding the experimental variation.
These results suggest that the GAVLK peptide fragmentation process in the ion trap mass spectrometer is predominantly driven by the thermodynamic stability of the precursor ions formed upon ionization of the sample. The computational approach presented in this manuscript successfully calculated ion intensities in the mass spectra of this doubly charged tryptic peptide, based solely on its amino acid sequence. As such, this work indicates a potential of incorporating quantum mechanical calculations into mass spectrometry based algorithms for molecular identification.
Shotgun proteomics experiments are dependent upon database search engines to identify peptides from tandem mass spectra. Many of these algorithms score potential identifications by evaluating the number of fragment ions matched between each peptide sequence and an observed spectrum. These systems, however, generally do not distinguish between matching an intense peak and matching a minor peak. We have developed a statistical model to score peptide matches that is based upon the multivariate hypergeometric distribution. This scorer, part of the “MyriMatch” database search engine, places greater emphasis on matching intense peaks. The probability that the best match for each spectrum has occurred by random chance can be employed to separate correct matches from random ones. We evaluated this software on data sets from three different laboratories employing three different ion trap instruments. Employing a novel system for testing discrimination, we demonstrate that stratifying peaks into multiple intensity classes improves the discrimination of scoring. We compare MyriMatch results to those of Sequest and X!Tandem, revealing that it is capable of higher discrimination than either of these algorithms. When minimal peak filtering is employed, performance plummets for a scoring model that does not stratify matched peaks by intensity. On the other hand, we find that MyriMatch discrimination improves as more peaks are retained in each spectrum. MyriMatch also scales well to tandem mass spectra from high-resolution mass analyzers. These findings may indicate limitations for existing database search scorers that count matched peaks without differentiating them by intensity. This software and source code is available under Mozilla Public License at this URL: http://www.mc.vanderbilt.edu/msrc/bioinformatics/.
Proteomics; Identification; Statistical Distribution; Reversed Database; Peak Filtering
Mass spectrometers can produce a large number of tandem mass spectra. They are unfortunately noise-contaminated. Noises can affect the quality of tandem mass spectra and thus increase the false positives and false negatives in the peptide identification. Therefore, it is appealing to develop an approach to denoising tandem mass spectra.
We propose a novel approach to denoising tandem mass spectra. The proposed approach consists of two modules: spectral peak intensity adjustment and intensity local maximum extraction. In the spectral peak intensity adjustment module, we introduce five features to describe the quality of each peak. Based on these features, a score is calculated for each peak and is used to adjust its intensity. As a result, the intensity will be adjusted to a local maximum if a peak is a signal peak, and it will be decreased if the peak is a noisy one. The second module uses a morphological reconstruction filter to remove the peaks whose intensities are not the local maxima of the spectrum. Experiments have been conducted on two ion trap tandem mass spectral datasets: ISB and TOV. Experimental results show that our algorithm can remove about 69% of the peaks of a spectrum. At the same time, the number of spectra that can be identified by Mascot algorithm increases by 31.23% and 14.12% for the two tandem mass spectra datasets, respectively.
The proposed denoising algorithm can be integrated into current popular peptide identification algorithms such as Mascot to improve the reliability of assigning peptides to spectra.
Availability of the software
The software created from this work is available upon request.
Protein quantification is an essential step in many proteomics experiments. A number of labeling approaches have been proposed and adopted in mass spectrometry (MS) based relative quantification. The mTRAQ, one of the stable isotope labeling methods, is amine-specific and available in triplex format, so that the sample throughput could be doubled when compared with duplex reagents.
Methods and results
Here we propose a novel data analysis algorithm for peptide quantification in triplex mTRAQ experiments. It improved the accuracy of quantification in two features. First, it identified and separated triplex isotopic clusters of a peptide in each full MS scan. We designed a schematic model of triplex overlapping isotopic clusters, and separated triplex isotopic clusters by solving cubic equations, which are deduced from the schematic model. Second, it automatically determined the elution areas of peptides. Some peptides have similar atomic masses and elution times, so their elution areas can have overlaps. Our algorithm successfully identified the overlaps and found accurate elution areas. We validated our algorithm using standard protein mixture experiments.
We showed that our algorithm was able to accurately quantify peptides in triplex mTRAQ experiments. Its software implementation is compatible with Trans-Proteomic Pipeline (TPP), and thus enables high-throughput analysis of proteomics data.
Protein-amide proton hydrogen-deuterium exchange (HDX) is used to investigate protein conformation, conformational changes and surface binding sites for other molecules. To our knowledge, software tools to automate data processing and analysis from sample fractionating (LC-MALDI) mass-spectrometry-based HDX workflows are not publicly available.
An integrated data pipeline (Solvent Explorer/TOF2H) has been developed for the processing of LC-MALDI-derived HDX data. Based on an experiment-wide template, and taking an ab initio approach to chromatographic and spectral peak finding, initial data processing is based on accurate mass-matching to fully deisotoped peaklists accommodating, in MS/MS-confirmed peptide library searches, ambiguous mass-hits to non-target proteins. Isotope-shift re-interrogation of library search results allows quick assessment of the extent of deuteration from peaklist data alone. During raw spectrum editing, each spectral segment is validated in real time, consistent with the manageable spectral numbers resulting from LC-MALDI experiments. A semi-automated spectral-segment editor includes a semi-automated or automated assessment of the quality of all spectral segments as they are pooled across an XIC peak for summing, centroid mass determination, building of rates plots on-the-fly, and automated back exchange correction. The resulting deuterium uptake rates plots from various experiments can be averaged, subtracted, re-scaled, error-barred, and/or scatter-plotted from individual spectral segment centroids, compared to solvent exposure and hydrogen bonding predictions and receive a color suggestion for 3D visualization. This software lends itself to a "divorced" HDX approach in which MS/MS-confirmed peptide libraries are built via nano or standard ESI without source modification, and HDX is performed via LC-MALDI using a standard MALDI-TOF. The complete TOF2H package includes additional (eg LC analysis) modules.
"TOF2H" provides a comprehensive HDX data analysis package that has accelerated the processing of LC-MALDI-based HDX data in the authors' lab from weeks to hours. It runs in a standard MS Windows (XP or Vista) environment, and can be downloaded or obtained from the authors at no cost.
Tandem mass spectra (MS/MS) produced using electron transfer dissociation (ETD) differ from those derived from collision-activated dissociation (CAD) in several important ways. Foremost, the predominant fragment ion series are different: c-and z•-type ions are favored in ETD spectra while b-and y-type ions comprise the bulk of the CAD spectra. Additionally, ETD spectra possess specific neutral losses and charge-reduced precursors . Most database search algorithms were designed to analyze CAD spectra, and have only recently been adapted to accomodate c-and z•-ions; however, inclusion of these additional spectral features can hinder identification, leading to lower confidence scores and decreased sensitivity. Therefore, it is important to pre-process spectral data prior to submission to a database search to remove those features which cause complications. Here, we demonstrate the effect of removing these features on the number of identifications at a 1% false discovery rate (FDR) using the open mass spectrometry search algorithm (OMSSA). When analyzing two biological replicates of a yeast protein extract in three total analyses, the number of identifications with a ~1% FDR increased from ~4611 to ~5931 upon spectral pre-processing – an increase of ~28.6%. We outline the most effective pre-processing methods, and provide free software containing these algorithms.
Analysis of multiple LC-MS based metabolomic studies is carried out to determine overlaps and differences among various experiments. For example, in large metabolic biomarker discovery studies involving hundreds of samples, it may be necessary to conduct multiple experiments, each involving a subset of the samples due to technical limitations. The ions selected from each experiment are analyzed to determine overlapping ions. One of the challenges in comparing the ion lists is the presence of a large number of derivative ions such as isotopes, adducts, and fragments. These derivative ions and the retention time drifts need to be taken into account during comparison.
We implemented an ion annotation-assisted method to determine overlapping ions in the presence of derivative ions. Following this, each ion is represented by the monoisotopic mass of its cluster. This mass is then used to determine overlaps among the ions selected across multiple experiments.
The resulting ion list provides better coverage and more accurate identification of metabolites compared to the traditional method in which overlapping ions are selected on the basis of individual ion mass.
Peptides are commonly identified by searching tandem mass spectrometric data against a protein sequence collection; the protein sequences are digested in silico according to the specificity of the enzyme used in the experiment, and the fragments of each peptide are calculated. With this approach, the information in the tandem mass spectra cannot fully be utilized because we can neither predict the probability for observing a peptide, nor accurately calculate the intensities of the fragment ions from the sequence. In contrast, if we search a spectrum library that has been built from experimental data, we can more effectively utilize the information, because the search space is minimized by searching only mass spectra that have previously been observed, and the intensities of the fragment ions in the query spectrum and the library spectrum are similar. Here, we present a method for constructing and using annotated peptide spectrum libraries (ASL).
The ASL were created from a set of consensus spectra associated with peptide sequences from approximately 13,000,000 confidently assigned experimental tandem spectra from the Global Proteome Machine Database, using a four-stage pipeline curation process to improve the reliability. The current ASL collection contains data for six model eukaryotic species: human, mouse, dog, cow, rat, and budding yeast. Average ASL number of spectra per gene ranged from 6.9 (human) to 3.8 (cow). The peptide sequences in these libraries are sequence aligned with the corresponding ENSEMBL, SWISS-PROT, IPI, or SGD accession numbers. A high-speed search engine, X! Hunter, was constructed to use these libraries, which can identify peptides from sets of experimental mass spectra at a rate of 20,000 spectra/second.
The speed and sensitivity of the search engine are compared with standard techniques, and application of ASL to the high-speed screening of experimental data and instrument control is discussed.
The advantages and disadvantages of acquiring tandem mass spectra by collision-induced dissociation (CID) of peptides in linear ion trap – Fourier-transform hybrid instruments are described. These instruments offer the possibility to transfer fragment ions from the linear ion trap to the FT-based analyzer for analysis with both high resolution and high mass accuracy. In addition, performing CID during the transfer of ions from the linear ion trap (LTQ) to the FT analyzer is also possible in instruments containing an additional collision cell (i.e., the “C-trap” in the LTQ-Orbitrap), resulting in tandem mass spectra over the full m/z range and not limited by the ejection q value of the LTQ. Our results show that these scan modes have lower duty cycles than tandem mass spectra acquired in the LTQ with nominal mass resolution, and typically result in fewer peptide identifications during data-dependent analysis of complex samples. However, the higher measured mass accuracy and resolution provides more specificity and hence provides a lower false positive ratio for the same number of true positives during database search of peptide tandem mass spectra. In addition, the search for modified and unexpected peptides is greatly facilitated with this data acquisition mode. It is therefore concluded that acquisition of tandem mass spectral data with high measured mass accuracy and resolution is a competitive alternative to “classical” data acquisition strategies, especially in situations of complex searches from large databases, searches for modified peptides, or for peptides resulting from unspecific cleavages.
Mass spectrometry is an essential technique in proteomics both to identify the proteins of a biological sample and to compare proteomic profiles of different samples. In both cases, the main phase of the data analysis is the procedure to extract the significant features from a mass spectrum. Its final output is the so-called peak list which contains the mass, the charge and the intensity of every detected biomolecule. The main steps of the peak list extraction procedure are usually preprocessing, peak detection, peak selection, charge determination and monoisotoping operation.
This paper describes an original algorithm for peak list extraction from low and high resolution mass spectra. It has been developed principally to improve the precision of peak extraction in comparison to other reference algorithms. It contains many innovative features among which a sophisticated method for managing the overlapping isotopic distributions.
The performances of the basic version of the algorithm and of its optional functionalities have been evaluated in this paper on both SELDI-TOF, MALDI-TOF and ESI-FTICR ECD mass spectra. Executable files of MassSpec, a MATLAB implementation of the peak list extraction procedure for Windows and Linux systems, can be downloaded free of charge for nonprofit institutions from the following web site: http://aimed11.unipv.it/MassSpec
Tandem mass spectrometry (MS/MS) is a powerful tool for protein identification. Although great efforts have been made in scoring the correlation between tandem mass spectra and an amino acid sequence database, improvements could be made in three aspects, including characterization ofpeaks in spectra, adoption of effective scoring functions and access to thereliability of matching between peptides and spectra.
A novel scoring function is presented, along with criteria to estimate the performance confidence of the function. Through learning the typesof product ions and the probability of generating them, a hypothetic spectrum was generated for each candidate peptide. Then relative entropy was introduced to measure the similarity between the hypothetic and the observed spectra. Based on the extreme value distribution (EVD) theory, a threshold was chosen to distinguish a true peptide assignment from a random one. Tests on a public MS/MS dataset demonstrated that this method performs better than the well-known SEQUEST.
A reliable identification of proteins from the spectra promises a more efficient application of tandem mass spectrometry to proteomes with high complexity.
False positives that arise when MS/MS data are used to search protein sequence databases remain a concern in proteomics research. Here we present five types of false positives identified when aligning sequences to MS/MS spectra by Mascot database searching software. False positives arise because of 1) enzymatic digestion at abnormal sites; 2) misinterpretation of charge states; 3) misinterpretation of protein modifications; 4) incorrect assignment of the protein modification site; and 5) incorrect use of isotopic peaks. We present examples, clearly identified as false positives by manual inspection, that nevertheless were assigned high scores by Mascot sequence alignment algorithm. In some examples, the sequence assigned to the MS/MS spectrum explains more than 80% of the fragment ions present. Because of high sequence similarity between the false positives and their corresponding true hits, the false positive rate cannot be evaluated by the common method of using a reversed or scrambled sequence database. A common feature of the false positives is the presence of unmatched peaks in the MS/MS spectra. Our studies highlight the importance of using unmatched peaks to remove false positives and offer direction to aid development of better sequence alignment algorithms for peptide and PTM identification.
protein identification; manual verification; automated database search
The promise of mass spectrometry as a tool for probing signal-transduction is predicated on reliable identification of post-translational modifications. Phosphorylations are key mediators of cellular signaling, yet are hard to detect, partly because of unusual fragmentation patterns of phosphopeptides. In addition to being accurate, MS/MS identification software must be robust and efficient to deal with increasingly large spectral data sets. Here, we present a new scoring function for the Inspect software for phosphorylated peptide tandem mass spectra for ion-trap instruments, without the need for manual validation. The scoring function was modeled by learning fragmentation patterns from 7677 validated phosphopeptide spectra. We compare our algorithm against SEQUEST and X!Tandem on testing and training data sets. At a 1% false positive rate, Inspect identified the greatest total number of phosphorylated spectra, 13% more than SEQUEST and 39% more than X!Tandem. Spectra identified by Inspect tended to score better in several spectral quality measures. Furthermore, Inspect runs much faster than either SEQUEST or X!Tandem, making desktop phosphoproteomics feasible. Finally, we used our new models to reanalyze a corpus of 423 000 LTQ spectra acquired for a phosphoproteome analysis of Saccharomyces cerevisiae DNA damage and repair pathways and discovered 43% more phosphopeptides than the previous study.
Phosphoproteomics; Scoring; High-throughput proteomics; Post-translational modifications
Mass spectrometry based metabolomics represents a new area for bioinformatics technology development. While the computational tools currently available such as XCMS statistically assess and rank LC–MS features, they do not provide information about their structural identity. XCMS2 is an open source software package which has been developed to automatically search tandem mass spectrometry (MS/MS) data against high quality experimental MS/MS data from known metabolites contained in a reference library (METLIN). Scoring of hits is based on a “shared peak count” method that identifies masses of fragment ions shared between the analytical and reference MS/MS spectra. Another functional component of XCMS2 is the capability of providing structural information for unknown metabolites, which are not in the METLIN database. This “similarity search” algorithm has been developed to detect possible structural motifs in the unknown metabolite which may produce characteristic fragment ions and neutral losses to related reference compounds contained in METLIN, even if the precursor masses are not the same.