The Human Proteome Project was launched in September 2010 with the goal of characterizing at least one protein product from each protein-coding gene. Here we assess how much of the proteome has been detected to date via tandem mass spectrometry by analyzing PeptideAtlas, a compendium of human derived LC-MS/MS proteomics data from many laboratories around the world. All datasets are processed with a consistent set of parameters using the Trans-Proteomic Pipeline and subjected to a 1% protein FDR filter before inclusion in PeptideAtlas. Therefore, PeptideAtlas contains only high confidence protein identifications. To increase proteome coverage, we explored new comprehensive public data sources for data likely to add new proteins to the Human PeptideAtlas. We then folded these data into a Human PeptideAtlas 2012 build and mapped it to Swiss-Prot, a protein sequence database curated to contain one entry per human protein coding gene. We find that this latest PeptideAtlas build includes at least one peptide for each of ~12,500 Swiss-Prot entries, leaving ~7500 gene products yet to be confidently cataloged. We characterize these “PA-unseen” proteins in terms of tissue localization, transcript abundance, and Gene Ontology enrichment, and propose reasons for their absence from PeptideAtlas and strategies for detecting them in the future.
Human Proteome Project; PeptideAtlas; LC-MS/MS; database; protein inference
The Human Proteome Organisation Proteomics Standards Initiative (HUPO-PSI) was established in 2002 with the aim of defining community standards for data representation in proteomics and facilitating data comparison, exchange and verification. Over the last 10 years significant advances have been made, with common data standards now published and implemented in the field of both mass spectrometry and molecular interactions. The 2012 meeting further advanced this work, with the mass spectrometry groups finalising approaches to capturing the output from recent developments in the field, such as quantitative proteomics and SRM. The molecular interaction group focused on improving the integration of data from multiple resources. Both groups united with a guest work track, organized by the HUPO Technology/Standards Committee, to formulate proposals for data submissions from the HUPO Human Proteome Project and to start an initiative to collect standard experimental protocols.
This paper focuses on the use of controlled vocabularies (CVs) and ontologies especially in the area of proteomics, primarily related to the work of the Proteomics Standards Initiative (PSI). It describes the relevant proteomics standard formats and the ontologies used within them. Software and tools for working with these ontology files are also discussed. The article also examines the “mapping files” used to ensure correct controlled vocabulary terms that are placed within PSI standards and the fulfillment of the MIAPE (Minimum Information about a Proteomics Experiment) requirements. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era. Guest Editors: Martin Eisenacher and Christian Stephan.
► The semantic annotation using ontologies is a prerequisite for the semantic web. ► The HUPO-PSI defined a set of XML-based standard formats for proteomics. ► These standard formats allow the referencing of CV terms defined in obo files. ► The CV terms can be used to enforce MIAPE compliance of the data files. ► The mass spectrometry CV is constantly maintained in a community process.
ANDI-MS, Analytical Data Interchange format for Mass Spectrometry; AniML, Analytical Information Markup Language; API, Application Programming Interface; ASCII, American Standard Code for Information Interchange; ASTM, American Society for Testing and Materials; BTO, BRENDA (BRaunschweig ENzyme DAtabase) Tissue Ontology; ChEBI, Chemical Entities of Biological Interest; CV, Controlled Vocabulary; DL, Description Logic; EBI, European Bioinformatics Institute; HDF5, Hierarchical Data Format, version 5; HUPO-PSI, Human Proteome Organisation-Proteomics Standards Initiative; ICD, International Classification of Diseases; IUPAC, International Union for Pure and Applied Chemistry; JCAMP-DX, Joint Committee on Atomic and Molecular Physical data-Data eXchange format; MALDI, Matrix Assisted Laser Desorption Ionization; MeSH, Medical Subject Headings; MI, Molecular Interaction; MIBBI, Minimal Information for Biological and Biomedical Investigations; MITAB, Molecular Interactions TABular format; MIAPE, Minimum Information About a Proteomics Experiment; MS, Mass Spectrometry; NCBI, National Center for Biotechnology Information; NCBO, National Center for Biomedical Ontology; netCDF, Network Common Data Format; OBI, Ontology for Biomedical Investigations; OBO, Open Biological and Biomedical Ontologies; OLS, Ontology Lookup Service; OWL, Web Ontology Language; PAR, Protein Affinity Reagents; PATO, Phenotype Attribute Trait Ontology; PRIDE, PRoteomics IDEntifications database; RDF(S), Resource Description Framework (Schema); SRM, Selected Reaction Monitoring; TPP, Trans-Proteomic Pipeline; URI, Uniform Resource Identifier; XSLT, eXtensible Stylesheet Language Transformation; YAFMS, Yet Another Format for Mass Spectrometry; Proteomics data standards; Controlled vocabularies; Ontologies in proteomics; Ontology formats; Ontology editors and software; Ontology maintenance
Public repositories for proteomics data have accelerated proteomics research by enabling more efficient cross-analyses of datasets, supporting the creation of protein and peptide compendia of experimental results, supporting the development and testing of new software tools, and facilitating the manuscript review process. The repositories available to date have been designed to accommodate either shotgun experiments or generic proteomic data files. Here, we describe a new kind of proteomic data repository for the collection and representation of data from selected reaction monitoring (SRM) measurements. The PeptideAtlas SRM Experiment Library (PASSEL) allows researchers to easily submit proteomic data sets generated by SRM. The raw data are automatically processed in a uniform manner and the results are stored in a database, where they may be downloaded or browsed via a web interface that includes a chromatogram viewer. PASSEL enables cross-analysis of SRM data, supports optimization of SRM data collection, and facilitates the review process of SRM data. Further, PASSEL will help in the assessment of proteotypic peptide performance in a wide array of samples containing the same peptide, as well as across multiple experimental protocols.
data repository; MRM; software; SRM; targeted proteomics
The application of mass spectrometry (MS) to the analysis of proteomes has enabled the high-throughput identification and abundance measurement of hundreds to thousands of proteins per experiment. However, the formidable informatics challenge associated with analyzing MS data has required a wide variety of data file formats to encode the complex data types associated with MS workflows. These formats encompass the encoding of input instruction for instruments, output products of the instruments, and several levels of information and results used by and produced by the informatics analysis tools. A brief overview of the most common file formats in use today is presented here, along with a discussion of related topics.
The rigorous testing of hypotheses on suitable sample cohorts is a major limitation in translational research. This is particularly the case for the validation of protein biomarkers where the lack of accurate, reproducible and sensitive assays for most proteins has precluded the systematic assessment of hundreds of potential marker proteins described in the literature.
Here, we describe a high throughput method for the development and refinement of selected reaction monitoring (SRM) assays for human proteins. The method was applied to generate such assays for more than 1000 cancer-associated proteins, which are functionally related to candidate cancer driver mutations. We used the assays to determine the detectability of the target proteins in two clinically relevant samples, plasma and urine. 182 proteins were detected in depleted plasma, spanning five orders of magnitude in abundance and reaching below a concentration of 10 ng/mL. The narrower concentration range of proteins in urine allowed the detection of 408 proteins. Moreover, we demonstrate that these SRM assays allow the reproducible quantification of 34 biomarker candidates across 84 patient plasma samples. Through public access to the entire assay library, which will also be expandable in the future, researchers will be able to target their cancer-associated proteins of interest in any sample type using the detectability information in plasma and urine as a guide. The generated reference map of SRM assays for cancer-associated proteins is a valuable resource for accelerating and planning biomarker verification studies.
Access to public data sets is important to the scientific community as a resource to develop new experiments or validate new data. Projects such as the PeptideAtlas, Ensembl and The Cancer Genome Atlas (TCGA) offer both access to public data and a repository to share their own data. Access to these data sets is often provided through a web page form and a web service API. Access technologies based on web protocols (e.g. http) have been in use for over a decade and are widely adopted across the industry for a variety of functions (e.g. search, commercial transactions, and social media). Each architecture adapts these technologies to provide users with tools to access and share data. Both commonly used web service technologies (e.g. REST and SOAP), and custom-built solutions over HTTP are utilized in providing access to research data. Providing multiple access points ensures that the community can access the data in the simplest and most effective manner for their particular needs. This article examines three common access mechanisms for web accessible data: BioMart, caBIG, and Google Data Sources. These are illustrated by implementing each over the PeptideAtlas repository and reviewed for their suitability based on specific usages common to research. BioMart, Google Data Sources, and caBIG are each suitable for certain uses. The tradeoffs made in the development of the technology are dependent on the uses each was designed for (e.g. security versus speed). This means that an understanding of specific requirements and tradeoffs is necessary before selecting the access technology.
BioMart; Google Data Sources; caBIG; data access; proteomics
The range of heterogeneous approaches available for quantifying protein abundance via mass spectrometry (MS)1 leads to considerable challenges in modeling, archiving, exchanging, or submitting experimental data sets as supplemental material to journals. To date, there has been no widely accepted format for capturing the evidence trail of how quantitative analysis has been performed by software, for transferring data between software packages, or for submitting to public databases. In the context of the Proteomics Standards Initiative, we have developed the mzQuantML data standard. The standard can represent quantitative data about regions in two-dimensional retention time versus mass/charge space (called features), peptides, and proteins and protein groups (where there is ambiguity regarding peptide-to-protein inference), and it offers limited support for small molecule (metabolomic) data. The format has structures for representing replicate MS runs, grouping of replicates (for example, as study variables), and capturing the parameters used by software packages to arrive at these values. The format has the capability to reference other standards such as mzML and mzIdentML, and thus the evidence trail for the MS workflow as a whole can now be described. Several software implementations are available, and we encourage other bioinformatics groups to use mzQuantML as an input, internal, or output format for quantitative software and for structuring local repositories. All project resources are available in the public domain from the HUPO Proteomics Standards Initiative http://www.psidev.info/mzquantml.
Controlled vocabularies (CVs), i.e. a collection of predefined terms describing a modeling domain, used for the semantic annotation of data, and ontologies are used in structured data formats and databases to avoid inconsistencies in annotation, to have a unique (and preferably short) accession number and to give researchers and computer algorithms the possibility for more expressive semantic annotation of data. The Human Proteome Organization (HUPO)–Proteomics Standards Initiative (PSI) makes extensive use of ontologies/CVs in their data formats. The PSI-Mass Spectrometry (MS) CV contains all the terms used in the PSI MS–related data standards. The CV contains a logical hierarchical structure to ensure ease of maintenance and the development of software that makes use of complex semantics. The CV contains terms required for a complete description of an MS analysis pipeline used in proteomics, including sample labeling, digestion enzymes, instrumentation parts and parameters, software used for identification and quantification of peptides/proteins and the parameters and scores used to determine their significance. Owing to the range of topics covered by the CV, collaborative development across several PSI working groups, including proteomics research groups, instrument manufacturers and software vendors, was necessary. In this article, we describe the overall structure of the CV, the process by which it has been developed and is maintained and the dependencies on other ontologies.
Database URL: http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo
Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the U.S. National Cancer Institute (NCI) convened the “International Workshop on Proteomic Data Quality Metrics” in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed up on two primary needs for the wide use of quality metrics: (1) an evolving list of comprehensive quality metrics and (2) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data.
By agreement, this article is published simultaneously in the Journal of Proteome Research, Molecular and Cellular Proteomics, Proteomics, and Proteomics Clinical Applications as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.
selected reaction monitoring; bioinformatics; data quality; metrics; open access; Amsterdam Principles; standards
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed.
We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed.
The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
We report the release of mzIdentML, an exchange standard for peptide and protein identification data, designed by the Proteomics Standards Initiative. The format was developed by the Proteomics Standards Initiative in collaboration with instrument and software vendors, and the developers of the major open-source projects in proteomics. Software implementations have been developed to enable conversion from most popular proprietary and open-source formats, and mzIdentML will soon be supported by the major public repositories. These developments enable proteomics scientists to start working with the standard for exchanging and publishing data sets in support of publications and they provide a stable platform for bioinformatics groups and commercial software vendors to work with a single file format for identification data.
Targeted proteomics via selected reaction monitoring is a powerful mass spectrometric technique affording higher dynamic range, increased specificity and lower limits of detection than other shotgun mass spectrometry methods when applied to proteome analyses. However, it involves selective measurement of predetermined analytes, which requires more preparation in the form of selecting appropriate signatures for the proteins and peptides that are to be targeted. There is a growing number of software programs and resources for selecting optimal transitions and the instrument settings used for the detection and quantification of the targeted peptides, but the exchange of this information is hindered by a lack of a standard format. We have developed a new standardized format, called TraML, for encoding transition lists and associated metadata. In addition to introducing the TraML format, we demonstrate several implementations across the community, and provide semantic validators, extensive documentation, and multiple example instances to demonstrate correctly written documents. Widespread use of TraML will facilitate the exchange of transitions, reduce time spent handling incompatible list formats, increase the reusability of previously optimized transitions, and thus accelerate the widespread adoption of targeted proteomics via selected reaction monitoring.
Honey bees are a mainstay of agriculture, contributing billions of dollars through their pollination activities. Bees have been a model system for sociality and group behavior for decades but only recently have molecular techniques been brought to study this fascinating and valuable organism. With the release of the first draft of its genome in 2006, proteomics of bees became feasible and over the past five years we have amassed in excess of 5E+6 MS/MS spectra. The lack of a consolidated platform to organize this massive resource hampers our ability, and that of others, to mine the information to its maximum potential.
Here we introduce the Honey Bee PeptideAtlas, a web-based resource for visualizing mass spectrometry data across experiments, providing protein descriptions and Gene Ontology annotations where possible. We anticipate that this will be helpful in planning proteomics experiments, especially in the selection of transitions for selected reaction monitoring. Through a proteogenomics effort, we have used MS/MS data to anchor the annotation of previously undescribed genes and to re-annotate previous gene models in order to improve the current genome annotation.
The Honey Bee PeptideAtlas will contribute to the efficiency of bee proteomics and accelerate our understanding of this species. This publicly accessible and interactive database is an important framework for the current and future analysis of mass spectrometry data.
PeptideAtlas is a multi-species compendium of peptides observed with tandem mass spectrometry methods. Raw mass spectrometer output files are collected from the community and reprocessed through a uniform analysis and validation pipeline that continues to advance. The results are loaded into a database and the information derived from the raw data is returned to the community via several web-based data exploration tools. The PeptideAtlas resource is useful for experiment planning, improving genome annotation, and other data mining projects. PeptideAtlas has become especially useful for planning targeted proteomics experiments.
proteomics; data repository; proteome; database; SRM
Mass spectrometry is an important technique for analyzing proteins and other biomolecular compounds in biological samples. Each of the vendors of these mass spectrometers uses a different proprietary binary output file format, which has hindered data sharing and the development of open source software for downstream analysis. The solution has been to develop, with the full participation of academic researchers as well as software and hardware vendors, an open XML-based format for encoding mass spectrometer output files, and then to write software to use this format for archiving, sharing, and processing. This chapter presents the various components and information available for this format, mzML. In addition to the XML schema that defines the file structure, a controlled vocabulary provides clear terms and definitions for the spectral metadata, and a semantic validation rules mapping file allows the mzML semantic validator to insure that an mzML document complies with one of several levels of requirements. Complete documentation and example files insure that the format may be uniformly implemented. At the time of release there already existed several implementations of the format and vendors have committed to supporting the format in their products.
file format; mzML; standards; XML; controlled vocabulary
Electron transfer dissociation (ETD) is an alternative fragmentation technique to collision induced dissociation (CID) that has recently become commercially available. ETD has several advantages over CID. It is less prone to fragmenting amino acid side chains, especially those that are modified, thus yielding fragment ion spectra with more uniform peak intensities. Further, precursor ions of longer peptides and higher charge states can be fragmented and identified. However, analysis of ETD spectra has a few important differences that require the optimization of the software packages used for the analysis of CID data, or the development of specialized tools. We have adapted the Trans-Proteomic Pipeline (TPP) to process ETD data. Specifically, we have added support for fragment ion spectra from high charge precursors, compatibility with charge-state estimation algorithms, provisions for the use of the Lys-C protease, capabilities for ETD spectrum library building, and updates to the data formats to differentiate CID and ETD spectra. We show the results of processing datasets from several different types of ETD instruments and demonstrate that application of the ETD-enhanced TPP can increase the number of spectrum identifications at a fixed false discovery rate by as much as 100% over native output from a single sequence search engine.
shotgun proteomics; electron-transfer dissociation; bioinformatics
The Trans-Proteomic Pipeline (TPP) is a suite of software tools for the analysis of tandem mass spectrometry datasets. The tools encompass most of the steps in a proteomic data analysis workflow in a single, integrated software system. Specifically, the TPP supports all steps from spectrometer output file conversion to protein-level statistical validation, including quantification by stable isotope ratios. We describe here the full workflow of the TPP and the tools therein, along with an example on a sample dataset, demonstrating that the set up and use of the tools is straightforward and well supported and does not require specialized informatics resources or knowledge.
Multiple reaction monitoring mass spectrometry (MRM-MS) is a targeted analysis method that has been increasingly viewed as an avenue to explore proteomes with unprecedented sensitivity and throughput. We have developed a software tool, called MaRiMba, to automate the creation of explicitly defined MRM transition lists required to program triple quadrupole mass spectrometers in such analyses. MaRiMba creates MRM transition lists from downloaded or custom-built spectral libraries, restricts output to specified proteins or peptides, and filters based on precursor peptide and product ion properties. MaRiMba can also create MRM lists containing corresponding transitions for isotopically heavy peptides, for which the precursor and product ions are adjusted according to user specifications. This open-source application is operated through a graphical user interface incorporated into the Trans-Proteomic Pipeline, and it outputs the final MRM list to a text file for upload to MS instruments. To illustrate the use of MaRiMba, we used the tool to design and execute an MRM-MS experiment in which we targeted the proteins of a well-defined and previously published standard mixture.
multiple reaction monitoring (MRM); selective reaction monitoring (SRM); MRM transition; transition list; spectral library; mass spectrometry; targeted proteomics
Mass spectrometry is a fundamental tool for discovery and analysis in the life sciences. With the rapid advances in mass spectrometry technology and methods, it has become imperative to provide a standard output format for mass spectrometry data that will facilitate data sharing and analysis. Initially, the efforts to develop a standard format for mass spectrometry data resulted in multiple formats, each designed with a different underlying philosophy. To resolve the issues associated with having multiple formats, vendors, researchers, and software developers convened under the banner of the HUPO PSI to develop a single standard. The new data format incorporated many of the desirable technical attributes from the previous data formats, while adding a number of improvements, including features such as a controlled vocabulary with validation tools to ensure consistent usage of the format, improved support for selected reaction monitoring data, and immediately available implementations to facilitate rapid adoption by the community. The resulting standard data format, mzML, is a well tested open-source format for mass spectrometer output files that can be readily utilized by the community and easily adapted for incremental advances in mass spectrometry technology.
Systems biology conceptualizes biological systems as dynamic networks of interacting elements, whereby functionally important properties are thought to emerge from the structure of such networks. Due to the ubiquitous role of complexes of interacting proteins in biological systems, their subunit composition and temporal and spatial arrangement within the cell are of particular interest. ‘Visual proteomics’ attempts to localize individual macromolecular complexes inside of intact cells by template matching reference structures into cryo electron tomograms. Here we have combined quantitative mass spectrometry and cryo electron tomography to detect, count and localize specific protein complexes within the cytoplasm of the human pathogen Leptospira interrogans. We describe a novel scoring function for visual proteomics and assess its performance and accuracy under realistic conditions. We discuss current and general limitations of the approach, as well as expected improvements in the future.
Public proteomics databases such as PeptideAtlas contain peptides and proteins identified in mass spectrometry experiments. However, these databases lack information about human disease for researchers studying disease-related proteins. We have developed mspecLINE, a tool that combines knowledge about human disease in MEDLINE with empirical data about the detectable human proteome in PeptideAtlas. mspecLINE associates diseases with proteins by calculating the semantic distance between annotated terms from a controlled biomedical vocabulary. We used an established semantic distance measure that is based on the co-occurrence of disease and protein terms in the MEDLINE bibliographic database.
The mspecLINE web application allows researchers to explore relationships between human diseases and parts of the proteome that are detectable using a mass spectrometer. Given a disease, the tool will display proteins and peptides from PeptideAtlas that may be associated with the disease. It will also display relevant literature from MEDLINE. Furthermore, mspecLINE allows researchers to select proteotypic peptides for specific protein targets in a mass spectrometry assay.
Although mspecLINE applies an information retrieval technique to the MEDLINE database, it is distinct from previous MEDLINE query tools in that it combines the knowledge expressed in scientific literature with empirical proteomics data. The tool provides valuable information about candidate protein targets to researchers studying human disease and is freely available on a public web server.
Mass spectrometry based methods for relative proteome quantification have broadly impacted life science research. However, important research directions, particularly those involving mathematical modeling and simulation of biological processes, also critically depend on absolutely quantitative data, i.e. knowledge of the concentration of the expressed proteins as a function of cellular state. Until now, absolute protein concentration measurements of a significant fraction of the proteome (73%) have only been derived from genetically altered S. cerevisiae cells 1, a technique that is not directly portable from yeast to other species. In this study we developed and applied a mass spectrometry based strategy to determine the absolute quantity i.e. the average number of protein copies per cell in a cell population, for a significant fraction of the proteome in genetically unperturbed cells. Applying the technology to the human pathogen Leptospira interrogans, a spirochete responsible for Leptospirosis 4, we generated an absolute protein abundance scale for 83% of the mass spectrometry detectable proteome, from cells at different states. Taking advantage of the unique cellular dimensions of L. interrogans, we used cryo electron tomography (cryoET) morphological measurements to verify at the single cell level the average absolute abundance values of selected proteins determined by mass spectrometry on a population of cells. As the strategy is relatively fast and applicable to any cell type we expect that it will become a cornerstone of quantitative biology and systems biology.
We carried out a test sample study to try to identify errors leading to irreproducibility, including incompleteness of peptide sampling, in LC-MS-based proteomics. We distributed a test sample consisting of an equimolar mix of 20 highly purified recombinant human proteins, to 27 laboratories for identification. Each protein contained one or more unique tryptic peptides of 1250 Da to also test for ion selection and sampling in the mass spectrometer. Of the 27 labs, initially only 7 labs reported all 20 proteins correctly, and only 1 lab reported all the tryptic peptides of 1250 Da. Nevertheless, a subsequent centralized analysis of the raw data revealed that all 20 proteins and most of the 1250 Da peptides had in fact been detected by all 27 labs. The centralized analysis allowed us to determine sources of problems encountered in the study, which include missed identifications (false negatives), environmental contamination, database matching, and curation of protein identifications. Improved search engines and databases are likely to increase the fidelity of mass spectrometry-based proteomics.
The Minimum Information for Biological and Biomedical Investigations (MIBBI) project provides a resource for those exploring the range of extant minimum information checklists and fosters coordinated development of such checklists.