Controlled vocabularies (CVs), i.e. a collection of predefined terms describing a modeling domain, used for the semantic annotation of data, and ontologies are used in structured data formats and databases to avoid inconsistencies in annotation, to have a unique (and preferably short) accession number and to give researchers and computer algorithms the possibility for more expressive semantic annotation of data. The Human Proteome Organization (HUPO)–Proteomics Standards Initiative (PSI) makes extensive use of ontologies/CVs in their data formats. The PSI-Mass Spectrometry (MS) CV contains all the terms used in the PSI MS–related data standards. The CV contains a logical hierarchical structure to ensure ease of maintenance and the development of software that makes use of complex semantics. The CV contains terms required for a complete description of an MS analysis pipeline used in proteomics, including sample labeling, digestion enzymes, instrumentation parts and parameters, software used for identification and quantification of peptides/proteins and the parameters and scores used to determine their significance. Owing to the range of topics covered by the CV, collaborative development across several PSI working groups, including proteomics research groups, instrument manufacturers and software vendors, was necessary. In this article, we describe the overall structure of the CV, the process by which it has been developed and is maintained and the dependencies on other ontologies.
Database URL: http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo
Spectral libraries have emerged as a viable alternative to protein sequence databases for peptide identification. These libraries contain previously detected peptide sequences and their corresponding tandem mass spectra (MS/MS). Search engines can then identify peptides by comparing experimental MS/MS scans to those in the library. Many of these algorithms employ the dot product score for measuring the quality of a spectrum-spectrum match (SSM). This scoring system does not offer a clear statistical interpretation and ignores fragment ion m/z discrepancies in the scoring. We developed a new spectral library search engine, Pepitome, which employs statistical systems for scoring SSMs. Pepitome outperformed the leading library search tool, SpectraST, when analyzing data sets acquired on three different mass spectrometry platforms. We characterized the reliability of spectral library searches by confirming shotgun proteomics identifications through RNA-Seq data. Applying spectral library and database searches on the same sample revealed their complementary nature. Pepitome identifications enabled the automation of quality analysis and quality control (QA/QC) for shotgun proteomics data acquisition pipelines.
Summary: The large amount of data produced by proteomics experiments requires effective bioinformatics tools for the integration of data management and data analysis. Here we introduce a suite of tools developed at Vanderbilt University to support production proteomics. We present the Backup Utility Service tool for automated instrument file backup and the ScanSifter tool for data conversion. We also describe a queuing system to coordinate identification pipelines and the File Collector tool for batch copying analytical results. These tools are individually useful but collectively reinforce each other. They are particularly valuable for proteomics core facilities or research institutions that need to manage multiple mass spectrometers. With minor changes, they could support other types of biomolecular resource facilities.
Availability and Implementation: Source code and executable versions are available under Apache 2.0 License at http://www.vicc.org/jimayersinstitute/data/
In shotgun proteomics, protein identification by tandem mass spectrometry relies on bioinformatics tools. Despite recent improvements in identification algorithms, a significant number of high quality spectra remain unidentified for various reasons. Here we present ScanRanker, an open-source tool that evaluates the quality of tandem mass spectra via sequence tagging with reliable performance in data from different instruments. The superior performance of ScanRanker enables it not only to find unassigned high quality spectra that evade identification through database search, but also to select spectra for de novo sequencing and cross-linking analysis. In addition, we demonstrate that the distribution of ScanRanker scores predicts the richness of identifiable spectra among multiple LC-MS/MS runs in an experiment, and ScanRanker scores assist the process of peptide assignment validation to increase confident spectrum identifications. The source code and executable versions of ScanRanker are available from http://fenchurch.mc.vanderbilt.edu.
spectral quality; sequence tagging; bioinformatics; tandem mass spectrometry; cross-linking
We report the release of mzIdentML, an exchange standard for peptide and protein identification data, designed by the Proteomics Standards Initiative. The format was developed by the Proteomics Standards Initiative in collaboration with instrument and software vendors, and the developers of the major open-source projects in proteomics. Software implementations have been developed to enable conversion from most popular proprietary and open-source formats, and mzIdentML will soon be supported by the major public repositories. These developments enable proteomics scientists to start working with the standard for exchanging and publishing data sets in support of publications and they provide a stable platform for bioinformatics groups and commercial software vendors to work with a single file format for identification data.
Toxicoproteomic samples are rich in posttranslational modifications (PTMs) of proteins. Identifying these modifications via standard database searching can incur significant performance penalties. Here we describe the latest developments in TagRecon, an algorithm that leverages inferred sequence tags to identify modified peptides in toxicoproteomic data sets. TagRecon identifies known modifications more effectively than the MyriMatch database search engine. TagRecon outperformed state of the art software in recognizing unanticipated modifications from LTQ, Orbitrap, and QTOF data sets. We developed user-friendly software for detecting persistent mass shifts from samples. We follow a three-step strategy for detecting unanticipated PTMs in samples. First, we identify the proteins present in the sample with a standard database search. Next, identified proteins are interrogated for unexpected PTMs with a sequence tag-based search. Finally, additional evidence is gathered for the detected mass shifts with a refinement search. Application of this technology on toxicoproteomic data sets revealed unintended cross-reactions between proteins and sample processing reagents. Twenty five proteins in rat liver showed signs of oxidative stress when exposed to potentially toxic drugs. These results demonstrate the value of mining toxicoproteomic data sets for modifications.
Targeted proteomics via selected reaction monitoring is a powerful mass spectrometric technique affording higher dynamic range, increased specificity and lower limits of detection than other shotgun mass spectrometry methods when applied to proteome analyses. However, it involves selective measurement of predetermined analytes, which requires more preparation in the form of selecting appropriate signatures for the proteins and peptides that are to be targeted. There is a growing number of software programs and resources for selecting optimal transitions and the instrument settings used for the detection and quantification of the targeted peptides, but the exchange of this information is hindered by a lack of a standard format. We have developed a new standardized format, called TraML, for encoding transition lists and associated metadata. In addition to introducing the TraML format, we demonstrate several implementations across the community, and provide semantic validators, extensive documentation, and multiple example instances to demonstrate correctly written documents. Widespread use of TraML will facilitate the exchange of transitions, reduce time spent handling incompatible list formats, increase the reusability of previously optimized transitions, and thus accelerate the widespread adoption of targeted proteomics via selected reaction monitoring.
Shotgun proteomics produces collections of tandem mass spectra that contain all the data needed to identify mutated peptides from clinical samples. Identifying these sequence variations, however, has not been feasible with conventional database search strategies, which require exact matches between observed and expected sequences. Searching for mutations as mass shifts on specified residues through database search can incur significant performance penalties and generate substantial false positive rates. Here we describe TagRecon, an algorithm that leverages inferred sequence tags to identify unanticipated mutations in clinical proteomic data sets. TagRecon identifies unmodified peptides as sensitively as the related MyriMatch database search engine. In both LTQ and Orbitrap data sets, TagRecon outperformed state of the art software in recognizing sequence mismatches from data sets with known variants. We developed guidelines for filtering putative mutations from clinical samples, and we applied them in an analysis of cancer cell lines and an examination of colon tissue. Mutations were found in up to 6% of identified peptides, and only a small fraction corresponded to dbSNP entries. The RKO cell line, which is DNA mismatch repair deficient, yielded more mutant peptides than the mismatch repair proficient SW480 line. Analysis of colon cancer tumor and adjacent tissue revealed hydroxyproline modifications associated with extracellular matrix degradation. These results demonstrate the value of using sequence tagging algorithms to fully interrogate clinical proteomic data sets.
mutation; bioinformatics; hydroxyproline; sequence tagging
Summary: Skyline is a Windows client application for targeted proteomics method creation and quantitative data analysis. It is open source and freely available for academic and commercial use. The Skyline user interface simplifies the development of mass spectrometer methods and the analysis of data from targeted proteomics experiments performed using selected reaction monitoring (SRM). Skyline supports using and creating MS/MS spectral libraries from a wide variety of sources to choose SRM filters and verify results based on previously observed ion trap data. Skyline exports transition lists to and imports the native output files from Agilent, Applied Biosystems, Thermo Fisher Scientific and Waters triple quadrupole instruments, seamlessly connecting mass spectrometer output back to the experimental design document. The fast and compact Skyline file format is easily shared, even for experiments requiring many sample injections. A rich array of graphs displays results and provides powerful tools for inspecting data integrity as data are acquired, helping instrument operators to identify problems early. The Skyline dynamic report designer exports tabular data from the Skyline document model for in-depth analysis with common statistical tools.
Availability: Single-click, self-updating web installation is available at http://proteome.gs.washington.edu/software/skyline. This web site also provides access to instructional videos, a support board, an issues list and a link to the source code project.
Supplementary information: Supplementary data are available at Bioinformatics online.
Mass spectrometry is a fundamental tool for discovery and analysis in the life sciences. With the rapid advances in mass spectrometry technology and methods, it has become imperative to provide a standard output format for mass spectrometry data that will facilitate data sharing and analysis. Initially, the efforts to develop a standard format for mass spectrometry data resulted in multiple formats, each designed with a different underlying philosophy. To resolve the issues associated with having multiple formats, vendors, researchers, and software developers convened under the banner of the HUPO PSI to develop a single standard. The new data format incorporated many of the desirable technical attributes from the previous data formats, while adding a number of improvements, including features such as a controlled vocabulary with validation tools to ensure consistent usage of the format, improved support for selected reaction monitoring data, and immediately available implementations to facilitate rapid adoption by the community. The resulting standard data format, mzML, is a well tested open-source format for mass spectrometer output files that can be readily utilized by the community and easily adapted for incremental advances in mass spectrometry technology.
Tandem mass spectrometry-based shotgun proteomics has become a widespread technology for analyzing complex protein mixtures. A number of database searching algorithms have been developed to assign peptide sequences to tandem mass spectra. Assembling the peptide identifications to proteins, however, is a challenging issue because many peptides are shared among multiple proteins. IDPicker is an open-source protein assembly tool that derives a minimum protein list from peptide identifications filtered to a specified False Discovery Rate. Here, we update IDPicker to increase confident peptide identifications by combining multiple scores produced by database search tools. By segregating peptide identifications for thresholding using both the precursor charge state and the number of tryptic termini, IDPicker retrieves more peptides for protein assembly. The new version is more robust against false positive proteins, especially in searches using multispecies databases, by requiring additional novel peptides in the parsimony process. IDPicker has been designed for incorporation in many identification workflows by the addition of a graphical user interface and the ability to read identifications from the pepXML format. These advances position IDPicker for high peptide discrimination and reliable protein assembly in large-scale proteomics studies. The source code and binaries for the latest version of IDPicker are available from http://fenchurch.mc.vanderbilt.edu/.
bioinformatics; parsimony; protein assembly; protein inference; false discovery rate
Assembling peptides identified from LC–MS/MS spectra into a list of proteins is a critical step in analyzing shotgun proteomics data. As one peptide sequence can be mapped to multiple proteins in a database, naïve protein assembly can substantially overstate the number of proteins found in samples. We model the peptide–protein relationships in a bipartite graph and use efficient graph algorithms to identify protein clusters with shared peptides and to derive the minimal list of proteins. We test the effects of this parsimony analysis approach using MS/MS data sets generated from a defined human protein mixture, a yeast whole cell extract, and a human serum proteome after MARS column depletion. The results demonstrate that the bipartite parsimony technique not only simplifies protein lists but also improves the accuracy of protein identification. We use bipartite graphs for the visualization of the protein assembly results to render the parsimony analysis process transparent to users. Our approach also groups functionally related proteins together and improves the comprehensibility of the results. We have implemented the tool in the IDPicker package. The source code and binaries for this protein assembly pipeline are available under Mozilla Public License at the following URL: http://www.mc.vanderbilt.edu/msrc/bioinformatics/.
parsimony analysis; bipartite graph; shotgun proteomics; LC-MS/MS; protein assembly
In shotgun proteomics, tandem mass spectra of peptides are typically identified through database search algorithms such as Sequest. We have developed DirecTag, an open-source algorithm to infer partial sequence tags directly from observed fragment ions. This algorithm is unique in its implementation of three separate scoring systems to evaluate each tag on the basis of peak intensity, m/z fidelity, and complementarity. In data sets from several types of mass spectrometers, DirecTag reproducibly exceeded the accuracy and speed of InsPecT and GutenTag, two previously published algorithms for this purpose. The source code and binaries for DirecTag are available from http://fenchurch.mc.vanderbilt.edu.
sequence tagging; bioinformatics; de novo; multi-platform; peptide identification
Shotgun proteomics experiments are dependent upon database search engines to identify peptides from tandem mass spectra. Many of these algorithms score potential identifications by evaluating the number of fragment ions matched between each peptide sequence and an observed spectrum. These systems, however, generally do not distinguish between matching an intense peak and matching a minor peak. We have developed a statistical model to score peptide matches that is based upon the multivariate hypergeometric distribution. This scorer, part of the “MyriMatch” database search engine, places greater emphasis on matching intense peaks. The probability that the best match for each spectrum has occurred by random chance can be employed to separate correct matches from random ones. We evaluated this software on data sets from three different laboratories employing three different ion trap instruments. Employing a novel system for testing discrimination, we demonstrate that stratifying peaks into multiple intensity classes improves the discrimination of scoring. We compare MyriMatch results to those of Sequest and X!Tandem, revealing that it is capable of higher discrimination than either of these algorithms. When minimal peak filtering is employed, performance plummets for a scoring model that does not stratify matched peaks by intensity. On the other hand, we find that MyriMatch discrimination improves as more peaks are retained in each spectrum. MyriMatch also scales well to tandem mass spectra from high-resolution mass analyzers. These findings may indicate limitations for existing database search scorers that count matched peaks without differentiating them by intensity. This software and source code is available under Mozilla Public License at this URL: http://www.mc.vanderbilt.edu/msrc/bioinformatics/.
Proteomics; Identification; Statistical Distribution; Reversed Database; Peak Filtering