We present a novel mass spectrometry-based high-throughput workflow and an open-source computational and data resource to reproducibly identify and quantify HLA-associated peptides. Collectively, the resources support the generation of HLA allele-specific peptide assay libraries consisting of consensus fragment ion spectra, and the analysis of quantitative digital maps of HLA peptidomes generated from a range of biological sources by SWATH mass spectrometry (MS). This study represents the first community-based effort to develop a robust platform for the reproducible and quantitative measurement of the entire repertoire of peptides presented by HLA molecules, an essential step towards the design of efficient immunotherapies.
The cells of the immune system protect us by recognizing telltale molecules produced by damaged and diseased cells, or by infection-causing microorganisms (which are also called pathogens). To help with this process, the cells in our bodies display small fragments of proteins (called peptides) on their surface that are then checked by the immune cells. Collectively, these peptides are referred to as the ‘immunopeptidome’, and deciphering the complexity of the human immunopeptidome is important for both basic research and medical science. Such an achievement would help to guide the development of next-generation vaccines and therapies against autoimmune disorders, infectious diseases and cancers.
In the past, immune peptides were mostly identified using a technique that is commonly called ‘shotgun’ mass spectrometry. However, this approach doesn't always provide reproducible results. In 2012, researchers reported the development of a new approach—which they called ‘SWATH’ mass spectrometry—that could yield more reproducible data.
Now, Caron et al.—including many of the researchers involved in the 2012 study—have developed a large collection of standardized tests that use SWATH mass spectrometry to analyze the human immunopeptidome. The workflow and the computational and data resources developed as part of this international effort are the first steps toward highly reproducible and measurable analyses of the immunopeptidome across many samples. Moreover, the large repository of assays generated by the project has been made public and will serve a large community of researchers, which should enable better collaborations.
In the future, SWATH mass spectrometry could be used as a robust technology for the reproducible detection and measurement of pathogen-specific or cancer-specific immune peptides. This could greatly help in the design of personalized immune-based therapies.
human leukocytes antigen; immunopeptidome; targeted mass spectrometry; SWATH-MS; DIA; human
Public repositories for proteomics data have accelerated proteomics research by enabling more efficient cross-analyses of datasets, supporting the creation of protein and peptide compendia of experimental results, supporting the development and testing of new software tools, and facilitating the manuscript review process. The repositories available to date have been designed to accommodate either shotgun experiments or generic proteomic data files. Here, we describe a new kind of proteomic data repository for the collection and representation of data from selected reaction monitoring (SRM) measurements. The PeptideAtlas SRM Experiment Library (PASSEL) allows researchers to easily submit proteomic data sets generated by SRM. The raw data are automatically processed in a uniform manner and the results are stored in a database, where they may be downloaded or browsed via a web interface that includes a chromatogram viewer. PASSEL enables cross-analysis of SRM data, supports optimization of SRM data collection, and facilitates the review process of SRM data. Further, PASSEL will help in the assessment of proteotypic peptide performance in a wide array of samples containing the same peptide, as well as across multiple experimental protocols.
data repository; MRM; software; SRM; targeted proteomics
The application of mass spectrometry (MS) to the analysis of proteomes has enabled the high-throughput identification and abundance measurement of hundreds to thousands of proteins per experiment. However, the formidable informatics challenge associated with analyzing MS data has required a wide variety of data file formats to encode the complex data types associated with MS workflows. These formats encompass the encoding of input instruction for instruments, output products of the instruments, and several levels of information and results used by and produced by the informatics analysis tools. A brief overview of the most common file formats in use today is presented here, along with a discussion of related topics.
PeptideAtlas, SRMAtlas and PASSEL are web-accessible resources to support discovery and targeted proteomics research. PeptideAtlas is a multi-species compendium of shotgun proteomic data provided by the scientific community, SRMAtlas is a resource of high-quality, complete proteome SRM assays generated in a consistent manner for the targeted identification and quantification of proteins, and PASSEL is a repository that compiles and represents selected reaction monitoring data, all in an easy to use interface. The databases are generated from native mass spectrometry data files that are analyzed in a standardized manner including statistical validation of the results. Each resource offers search functionalities and can be queried by user defined constraints; the query results are provided in tables or are graphically displayed. PeptideAtlas, SRMAtlas and PASSEL are publicly available freely via the website http://www.peptideatlas.org. In this protocol, we describe the use of these resources, we highlight how to submit, search, collate and download data.
discovery proteomics; targeted proteomics; selected reaction monitoring (SRM); data repository; data resource; complete proteome library
Targeted proteomics via selected reaction monitoring is a powerful mass spectrometric technique affording higher dynamic range, increased specificity and lower limits of detection than other shotgun mass spectrometry methods when applied to proteome analyses. However, it involves selective measurement of predetermined analytes, which requires more preparation in the form of selecting appropriate signatures for the proteins and peptides that are to be targeted. There is a growing number of software programs and resources for selecting optimal transitions and the instrument settings used for the detection and quantification of the targeted peptides, but the exchange of this information is hindered by a lack of a standard format. We have developed a new standardized format, called TraML, for encoding transition lists and associated metadata. In addition to introducing the TraML format, we demonstrate several implementations across the community, and provide semantic validators, extensive documentation, and multiple example instances to demonstrate correctly written documents. Widespread use of TraML will facilitate the exchange of transitions, reduce time spent handling incompatible list formats, increase the reusability of previously optimized transitions, and thus accelerate the widespread adoption of targeted proteomics via selected reaction monitoring.
One purpose of the biomedical literature is to report results in sufficient detail so that the methods of data collection and analysis can be independently replicated and verified. Here we present for consideration a minimum information specification for gene expression localization experiments, called the “Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments (MISFISHIE)”. It is modelled after the MIAME (Minimum Information About a Microarray Experiment) specification for microarray experiments. Data specifications like MIAME and MISFISHIE specify the information content without dictating a format for encoding that information. The MISFISHIE specification describes six types of information that should be provided for each experiment: Experimental Design, Biomaterials and Treatments, Reporters, Staining, Imaging Data, and Image Characterizations. This specification has benefited the consortium within which it was initially developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal.
Data standardization; Human Proteome Organisation; Proteomics Standards Initiative
Mass spectrometry information has long offered the potential of discovering biomarkers that would enable clinicians to diagnose disease, and treat it with targeted therapies. Hundreds of human samples alone have been used to generate thousands of spectra for identification. This data, and the generation of targeted peptide information, represents the first step in the process of locating disease biomarkers. Reaching the goal of clinical proteomics requires that this data be integrated with additional information from disease literature and genomic studies. Here we describe PeptideAtlas and associated methods for mining the data, as well as the software tools necessary to support large-scale integration and mining.
SRM; Mass spectrometry; proteomic; visualization; data mining
Progress in mass spectrometry-based methods for veterinary research and diagnostics is lagging behind compared to the human research, and proteome data of domestic animals is still not well represented in open source data repositories. This is particularly true for the equine species. Here we present a first Equine PeptideAtlas encompassing high-resolution tandem mass spectrometry analyses of 51 samples representing a selection of equine tissues and body fluids from healthy and diseased animals. The raw data were processed through the Trans-Proteomic Pipeline to yield high quality identification of proteins and peptides. The current release comprises 24,131 distinct peptides representing 2636 canonical proteins observed at false discovery rates of 0.2 % at the peptide level and 1.4 % at the protein level. Data from the Equine PeptideAtlas are available for experimental planning, validation of new datasets, and as a proteomic data mining resource. The advantages of the Equine PeptideAtlas are demonstrated by examples of mining the contents for information on potential and well-known equine acute phase proteins, which have extensive general interest in the veterinary clinic. The extracted information will support further analyses, and emphasises the value of the Equine PeptideAtlas as a resource for the design of targeted quantitative proteomic studies.
Acute phase proteins; Animal proteomics; Equine; PeptideAtlas; Proteotypic peptides
One year ago the Human Proteome Project (HPP) leadership designated the baseline metrics for the Human Proteome Project to be based upon neXtProt with a total of 13 664 proteins validated at protein evidence level 1 (PE1) by mass spectrometry, antibody-capture, Edman sequencing, or 3D structures. Corresponding chromosome-specific data were provided from PeptideAtlas, GPMdb, and Human Protein Atlas. This year the neXtProt total is 15 646 and the other resources, which are inputs to neXtProt, have high quality identifications and additional annotations for 14 012 in PeptideAtlas, 14 869 in GPMdb, and 10 976 in HPA. We propose to remove 638 genes from the denominator that are “uncertain” or “dubious” in Ensembl, UniProt/SwissProt, and neXtProt. That leaves 3844 “missing proteins”, currently having no or inadequate documentation, to be found from a new denominator of 19 490 protein-coding genes. We present those tabulations and weblinks and discuss current strategies to find the missing proteins.
Human Proteome Project; neXtProt; PeptideAtlas; GPMdb; Human Protein Atlas; metrics; missing proteins
The kidney, urine, and plasma proteomes are intimately related: proteins and metabolic waste products are filtered from the plasma by the kidney and excreted via the urine, while kidney proteins may be secreted into the circulation or released into the urine. Shotgun proteomics datasets derived from human kidney, urine, and plasma samples were collated and processed using a uniform software pipeline, and relative protein abundances were estimated by spectral counting. The resulting PeptideAtlas builds yielded 4005, 2491, and 3553 nonredundant proteins at 1% FDR for the kidney, urine, and plasma proteomes, respectively—for kidney and plasma, the largest high-confidence protein sets to date. The same pipeline applied to all available human data yielded a 2013 Human PeptideAtlas build containing 12,644 nonredundant proteins and at least one peptide for each of ~14,000 Swiss-Prot entries, an increase over 2012 of ~7.5% of the predicted human proteome. We demonstrate that abundances are correlated between plasma and urine, examine the most abundant urine proteins not derived from either plasma or kidney, and consider the biomarker potential of proteins associated with renal decline. This analysis forms part of the Biology and Disease-driven Human Proteome Project (B/D-HPP) and a contribution to the Chromosome-centric Human Proteome Project (C-HPP) special issue.
Human Proteome Project; PeptideAtlas; LC-MS/MS; database; kidney; plasma; urine; proteome comparison
Mass spectrometry is the method of choice for deep and reliable exploration of the (human) proteome. Targeted mass spectrometry reliably detects and quantifies pre-determined sets of proteins in a complex biological matrix and is used in studies that rely on the quantitatively accurate and reproducible measurement of proteins across multiple samples. It requires the one-time, a priori generation of a specific measurement assay for each targeted protein. SWATH-MS is a mass spectrometric method that combines data-independent acquisition (DIA) and targeted data analysis and vastly extends the throughput of proteins that can be targeted in a sample compared to selected reaction monitoring (SRM). Here we present a compendium of highly specific assays covering more than 10,000 human proteins and enabling their targeted analysis in SWATH-MS datasets acquired from research or clinical specimens. This resource supports the confident detection and quantification of 50.9% of all human proteins annotated by UniProtKB/Swiss-Prot and is therefore expected to find wide application in basic and clinical research. Data are available via ProteomeXchange (PXD000953-954) and SWATHAtlas (SAL00016-35).
To facilitate sharing of Omics data, many groups of scientists have been working to establish the relevant data standards. The main components of data sharing standards are experiment description standards, data exchange standards, terminology standards, and experiment execution standards. Here we provide a survey of existing and emerging standards that are intended to assist the free and open exchange of large-format data.
Data sharing; Data exchange; Data standards; MGED; MIAME; Ontology; Data format; Microarray; Proteomics; Metabolomics
We report progress assembling the parts list for chromosome 17 and illustrate the various processes that we have developed to integrate available data from diverse genomic and proteomic knowledge bases. As primary resources we have used GPMDB, neXtProt, PeptideAtlas, Human Protein Atlas (HPA), and GeneCards. All sites share the common resource of Ensembl for the genome modeling information. We have defined the chromosome 17 parts list with the following information: 1169 protein-coding genes, the numbers of proteins confidently identified by various experimental approaches as documented in GPMDB, neXtProt, PeptideAtlas, and HPA, examples of typical data sets obtained by RNASeq and proteomic studies of epithelial derived tumor cell lines (disease proteome) and a normal proteome (peripheral mononuclear cells), reported evidence of post-translational modifications, and examples of alternative splice variants (ASVs). We have constructed a list of the 59 ‘missing’ proteins as well as 201 proteins that have inconclusive mass spectrometric (MS) identifications. In this report we have defined a process to establish a baseline for the incorporation of new evidence on protein identification and characterization as well as related information from transcriptome analyses. This initial list of ‘missing’ proteins that will guide the selection of appropriate samples for discovery studies as well as antibody reagents. Also we have illustrated the significant diversity of protein variants (including post-translational modifications, PTMs) using regions on chromosome 17 that contain important oncogenes. We emphasize the need for mandated deposition of proteomics data in public databases, the further development of improved PTM, ASV and single nucleotide variant (SNV) databases and the construction of websites that can integrate and regularly update such information. In addition, we describe the distribution of both clustered and scattered sets of protein families on the chromosome. Since chromosome 17 is rich in cancer associated genes we have focused the clustering of cancer associated genes in such genomic regions and have used the ERBB2 amplicon as an example of the value of a proteogenomic approach in which one integrates transcriptomic with proteomic information and captures evidence of co-expression through coordinated regulation.
Chromosome-Centric Human Proteome Project; Chromosome 17 Parts List; ERBB2; Oncogene
Complete reference maps or datasets, like the genomic map of an organism, are highly beneficial tools for biological and biomedical research. Attempts to generate such reference datasets for a proteome so far failed to reach complete proteome coverage, with saturation apparent at approximately two thirds of the proteomes tested, even for the most thoroughly characterized proteomes. Here, we used a strategy based on high-throughput peptide synthesis and mass spectrometry to generate a close to complete reference map (97% of the genome-predicted proteins) of the S. cerevisiae proteome. We generated two versions of this mass spectrometric map one supporting discovery- (shotgun) and the other hypothesis-driven (targeted) proteomic measurements. The two versions of the map, therefore, constitute a complete set of proteomic assays to support most studies performed with contemporary proteomic technologies. The reference libraries can be browsed via a web-based repository and associated navigation tools. To demonstrate the utility of the reference libraries we applied them to a protein quantitative trait locus (pQTL) analysis, which requires measurement of the same peptides over a large number of samples with high precision. Protein measurements over a set of 78 S. cerevisiae strains revealed a complex relationship between independent genetic loci, impacting on the levels of related proteins. Our results suggest that selective pressure favors the acquisition of sets of polymorphisms that maintain the stoichiometry of protein complexes and pathways.
S. cerevisiae; selected reaction monitoring; SRM; MRM; spectral library; peptide library; mass spectrometric map; protein QTL
The Human Proteome Project was launched in September 2010 with the goal of characterizing at least one protein product from each protein-coding gene. Here we assess how much of the proteome has been detected to date via tandem mass spectrometry by analyzing PeptideAtlas, a compendium of human derived LC-MS/MS proteomics data from many laboratories around the world. All datasets are processed with a consistent set of parameters using the Trans-Proteomic Pipeline and subjected to a 1% protein FDR filter before inclusion in PeptideAtlas. Therefore, PeptideAtlas contains only high confidence protein identifications. To increase proteome coverage, we explored new comprehensive public data sources for data likely to add new proteins to the Human PeptideAtlas. We then folded these data into a Human PeptideAtlas 2012 build and mapped it to Swiss-Prot, a protein sequence database curated to contain one entry per human protein coding gene. We find that this latest PeptideAtlas build includes at least one peptide for each of ~12,500 Swiss-Prot entries, leaving ~7500 gene products yet to be confidently cataloged. We characterize these “PA-unseen” proteins in terms of tissue localization, transcript abundance, and Gene Ontology enrichment, and propose reasons for their absence from PeptideAtlas and strategies for detecting them in the future.
Human Proteome Project; PeptideAtlas; LC-MS/MS; database; protein inference
The Human Proteome Organisation Proteomics Standards Initiative (HUPO-PSI) was established in 2002 with the aim of defining community standards for data representation in proteomics and facilitating data comparison, exchange and verification. Over the last 10 years significant advances have been made, with common data standards now published and implemented in the field of both mass spectrometry and molecular interactions. The 2012 meeting further advanced this work, with the mass spectrometry groups finalising approaches to capturing the output from recent developments in the field, such as quantitative proteomics and SRM. The molecular interaction group focused on improving the integration of data from multiple resources. Both groups united with a guest work track, organized by the HUPO Technology/Standards Committee, to formulate proposals for data submissions from the HUPO Human Proteome Project and to start an initiative to collect standard experimental protocols.
Adoption of targeted mass spectrometry (MS) approaches such as multiple reaction monitoring (MRM) to study biological and biomedical questions is well underway in the proteomics community. Successful application depends on the ability to generate reliable assays that uniquely and confidently identify target peptides in a sample. Unfortunately, there is a wide range of criteria being applied to say that an assay has been successfully developed. There is no consensus on what criteria are acceptable and little understanding of the impact of variable criteria on the quality of the results generated. Publications describing targeted MS assays for peptides frequently do not contain sufficient information for readers to establish confidence that the tests work as intended or to be able to apply the tests described in their own labs. Guidance must be developed so that targeted MS assays with established performance can be made widely distributed and applied by many labs worldwide. To begin to address the problems and their solutions, a workshop was held at the National Institutes of Health with representatives from the multiple communities developing and employing targeted MS assays. Participants discussed the analytical goals of their experiments and the experimental evidence needed to establish that the assays they develop work as intended and are achieving the required levels of performance. Using this “fit-for-purpose” approach, the group defined three tiers of assays distinguished by their performance and extent of analytical characterization. Computational and statistical tools useful for the analysis of targeted MS results were described. Participants also detailed the information that authors need to provide in their manuscripts to enable reviewers and readers to clearly understand what procedures were performed and to evaluate the reliability of the peptide or protein quantification measurements reported. This paper presents a summary of the meeting and recommendations.
This paper focuses on the use of controlled vocabularies (CVs) and ontologies especially in the area of proteomics, primarily related to the work of the Proteomics Standards Initiative (PSI). It describes the relevant proteomics standard formats and the ontologies used within them. Software and tools for working with these ontology files are also discussed. The article also examines the “mapping files” used to ensure correct controlled vocabulary terms that are placed within PSI standards and the fulfillment of the MIAPE (Minimum Information about a Proteomics Experiment) requirements. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era. Guest Editors: Martin Eisenacher and Christian Stephan.
► The semantic annotation using ontologies is a prerequisite for the semantic web. ► The HUPO-PSI defined a set of XML-based standard formats for proteomics. ► These standard formats allow the referencing of CV terms defined in obo files. ► The CV terms can be used to enforce MIAPE compliance of the data files. ► The mass spectrometry CV is constantly maintained in a community process.
ANDI-MS, Analytical Data Interchange format for Mass Spectrometry; AniML, Analytical Information Markup Language; API, Application Programming Interface; ASCII, American Standard Code for Information Interchange; ASTM, American Society for Testing and Materials; BTO, BRENDA (BRaunschweig ENzyme DAtabase) Tissue Ontology; ChEBI, Chemical Entities of Biological Interest; CV, Controlled Vocabulary; DL, Description Logic; EBI, European Bioinformatics Institute; HDF5, Hierarchical Data Format, version 5; HUPO-PSI, Human Proteome Organisation-Proteomics Standards Initiative; ICD, International Classification of Diseases; IUPAC, International Union for Pure and Applied Chemistry; JCAMP-DX, Joint Committee on Atomic and Molecular Physical data-Data eXchange format; MALDI, Matrix Assisted Laser Desorption Ionization; MeSH, Medical Subject Headings; MI, Molecular Interaction; MIBBI, Minimal Information for Biological and Biomedical Investigations; MITAB, Molecular Interactions TABular format; MIAPE, Minimum Information About a Proteomics Experiment; MS, Mass Spectrometry; NCBI, National Center for Biotechnology Information; NCBO, National Center for Biomedical Ontology; netCDF, Network Common Data Format; OBI, Ontology for Biomedical Investigations; OBO, Open Biological and Biomedical Ontologies; OLS, Ontology Lookup Service; OWL, Web Ontology Language; PAR, Protein Affinity Reagents; PATO, Phenotype Attribute Trait Ontology; PRIDE, PRoteomics IDEntifications database; RDF(S), Resource Description Framework (Schema); SRM, Selected Reaction Monitoring; TPP, Trans-Proteomic Pipeline; URI, Uniform Resource Identifier; XSLT, eXtensible Stylesheet Language Transformation; YAFMS, Yet Another Format for Mass Spectrometry; Proteomics data standards; Controlled vocabularies; Ontologies in proteomics; Ontology formats; Ontology editors and software; Ontology maintenance
The rigorous testing of hypotheses on suitable sample cohorts is a major limitation in translational research. This is particularly the case for the validation of protein biomarkers where the lack of accurate, reproducible and sensitive assays for most proteins has precluded the systematic assessment of hundreds of potential marker proteins described in the literature.
Here, we describe a high throughput method for the development and refinement of selected reaction monitoring (SRM) assays for human proteins. The method was applied to generate such assays for more than 1000 cancer-associated proteins, which are functionally related to candidate cancer driver mutations. We used the assays to determine the detectability of the target proteins in two clinically relevant samples, plasma and urine. 182 proteins were detected in depleted plasma, spanning five orders of magnitude in abundance and reaching below a concentration of 10 ng/mL. The narrower concentration range of proteins in urine allowed the detection of 408 proteins. Moreover, we demonstrate that these SRM assays allow the reproducible quantification of 34 biomarker candidates across 84 patient plasma samples. Through public access to the entire assay library, which will also be expandable in the future, researchers will be able to target their cancer-associated proteins of interest in any sample type using the detectability information in plasma and urine as a guide. The generated reference map of SRM assays for cancer-associated proteins is a valuable resource for accelerating and planning biomarker verification studies.
Access to public data sets is important to the scientific community as a resource to develop new experiments or validate new data. Projects such as the PeptideAtlas, Ensembl and The Cancer Genome Atlas (TCGA) offer both access to public data and a repository to share their own data. Access to these data sets is often provided through a web page form and a web service API. Access technologies based on web protocols (e.g. http) have been in use for over a decade and are widely adopted across the industry for a variety of functions (e.g. search, commercial transactions, and social media). Each architecture adapts these technologies to provide users with tools to access and share data. Both commonly used web service technologies (e.g. REST and SOAP), and custom-built solutions over HTTP are utilized in providing access to research data. Providing multiple access points ensures that the community can access the data in the simplest and most effective manner for their particular needs. This article examines three common access mechanisms for web accessible data: BioMart, caBIG, and Google Data Sources. These are illustrated by implementing each over the PeptideAtlas repository and reviewed for their suitability based on specific usages common to research. BioMart, Google Data Sources, and caBIG are each suitable for certain uses. The tradeoffs made in the development of the technology are dependent on the uses each was designed for (e.g. security versus speed). This means that an understanding of specific requirements and tradeoffs is necessary before selecting the access technology.
BioMart; Google Data Sources; caBIG; data access; proteomics
The range of heterogeneous approaches available for quantifying protein abundance via mass spectrometry (MS)1 leads to considerable challenges in modeling, archiving, exchanging, or submitting experimental data sets as supplemental material to journals. To date, there has been no widely accepted format for capturing the evidence trail of how quantitative analysis has been performed by software, for transferring data between software packages, or for submitting to public databases. In the context of the Proteomics Standards Initiative, we have developed the mzQuantML data standard. The standard can represent quantitative data about regions in two-dimensional retention time versus mass/charge space (called features), peptides, and proteins and protein groups (where there is ambiguity regarding peptide-to-protein inference), and it offers limited support for small molecule (metabolomic) data. The format has structures for representing replicate MS runs, grouping of replicates (for example, as study variables), and capturing the parameters used by software packages to arrive at these values. The format has the capability to reference other standards such as mzML and mzIdentML, and thus the evidence trail for the MS workflow as a whole can now be described. Several software implementations are available, and we encourage other bioinformatics groups to use mzQuantML as an input, internal, or output format for quantitative software and for structuring local repositories. All project resources are available in the public domain from the HUPO Proteomics Standards Initiative http://www.psidev.info/mzquantml.
Mass-spectrometry-based proteomics has become an important component of biological research. Numerous proteomics methods have been developed to identify and quantify the proteins in biological and clinical samples1, identify pathways affected by endogenous and exogenous perturbations2, and characterize protein complexes3. Despite successes, the interpretation of vast proteomics datasets remains a challenge. There have been several calls for improvements and standardization of proteomics data analysis frameworks, as well as for an application-programming interface for proteomics data access4,5. In response, we have developed the ProteoWizard Toolkit, a robust set of open-source, software libraries and applications designed to facilitate proteomics research. The libraries implement the first-ever, non-commercial, unified data access interface for proteomics, bridging field-standard open formats and all common vendor formats. In addition, diverse software classes enable rapid development of vendor-agnostic proteomics software. Additionally, ProteoWizard projects and applications, building upon the core libraries, are becoming standard tools for enabling significant proteomics inquiries.
Controlled vocabularies (CVs), i.e. a collection of predefined terms describing a modeling domain, used for the semantic annotation of data, and ontologies are used in structured data formats and databases to avoid inconsistencies in annotation, to have a unique (and preferably short) accession number and to give researchers and computer algorithms the possibility for more expressive semantic annotation of data. The Human Proteome Organization (HUPO)–Proteomics Standards Initiative (PSI) makes extensive use of ontologies/CVs in their data formats. The PSI-Mass Spectrometry (MS) CV contains all the terms used in the PSI MS–related data standards. The CV contains a logical hierarchical structure to ensure ease of maintenance and the development of software that makes use of complex semantics. The CV contains terms required for a complete description of an MS analysis pipeline used in proteomics, including sample labeling, digestion enzymes, instrumentation parts and parameters, software used for identification and quantification of peptides/proteins and the parameters and scores used to determine their significance. Owing to the range of topics covered by the CV, collaborative development across several PSI working groups, including proteomics research groups, instrument manufacturers and software vendors, was necessary. In this article, we describe the overall structure of the CV, the process by which it has been developed and is maintained and the dependencies on other ontologies.
Database URL: http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo