The Human Proteome Organisation Proteomics Standards Initiative (HUPO-PSI) was established in 2002 with the aim of defining community standards for data representation in proteomics and facilitating data comparison, exchange and verification. Over the last 10 years significant advances have been made, with common data standards now published and implemented in the field of both mass spectrometry and molecular interactions. The 2012 meeting further advanced this work, with the mass spectrometry groups finalising approaches to capturing the output from recent developments in the field, such as quantitative proteomics and SRM. The molecular interaction group focused on improving the integration of data from multiple resources. Both groups united with a guest work track, organized by the HUPO Technology/Standards Committee, to formulate proposals for data submissions from the HUPO Human Proteome Project and to start an initiative to collect standard experimental protocols.
This paper focuses on the use of controlled vocabularies (CVs) and ontologies especially in the area of proteomics, primarily related to the work of the Proteomics Standards Initiative (PSI). It describes the relevant proteomics standard formats and the ontologies used within them. Software and tools for working with these ontology files are also discussed. The article also examines the “mapping files” used to ensure correct controlled vocabulary terms that are placed within PSI standards and the fulfillment of the MIAPE (Minimum Information about a Proteomics Experiment) requirements. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era. Guest Editors: Martin Eisenacher and Christian Stephan.
► The semantic annotation using ontologies is a prerequisite for the semantic web. ► The HUPO-PSI defined a set of XML-based standard formats for proteomics. ► These standard formats allow the referencing of CV terms defined in obo files. ► The CV terms can be used to enforce MIAPE compliance of the data files. ► The mass spectrometry CV is constantly maintained in a community process.
ANDI-MS, Analytical Data Interchange format for Mass Spectrometry; AniML, Analytical Information Markup Language; API, Application Programming Interface; ASCII, American Standard Code for Information Interchange; ASTM, American Society for Testing and Materials; BTO, BRENDA (BRaunschweig ENzyme DAtabase) Tissue Ontology; ChEBI, Chemical Entities of Biological Interest; CV, Controlled Vocabulary; DL, Description Logic; EBI, European Bioinformatics Institute; HDF5, Hierarchical Data Format, version 5; HUPO-PSI, Human Proteome Organisation-Proteomics Standards Initiative; ICD, International Classification of Diseases; IUPAC, International Union for Pure and Applied Chemistry; JCAMP-DX, Joint Committee on Atomic and Molecular Physical data-Data eXchange format; MALDI, Matrix Assisted Laser Desorption Ionization; MeSH, Medical Subject Headings; MI, Molecular Interaction; MIBBI, Minimal Information for Biological and Biomedical Investigations; MITAB, Molecular Interactions TABular format; MIAPE, Minimum Information About a Proteomics Experiment; MS, Mass Spectrometry; NCBI, National Center for Biotechnology Information; NCBO, National Center for Biomedical Ontology; netCDF, Network Common Data Format; OBI, Ontology for Biomedical Investigations; OBO, Open Biological and Biomedical Ontologies; OLS, Ontology Lookup Service; OWL, Web Ontology Language; PAR, Protein Affinity Reagents; PATO, Phenotype Attribute Trait Ontology; PRIDE, PRoteomics IDEntifications database; RDF(S), Resource Description Framework (Schema); SRM, Selected Reaction Monitoring; TPP, Trans-Proteomic Pipeline; URI, Uniform Resource Identifier; XSLT, eXtensible Stylesheet Language Transformation; YAFMS, Yet Another Format for Mass Spectrometry; Proteomics data standards; Controlled vocabularies; Ontologies in proteomics; Ontology formats; Ontology editors and software; Ontology maintenance
Data processing, management and visualization are central and critical components of a state of the art high-throughput mass spectrometry (MS)-based proteomics experiment, and are often some of the most time-consuming steps, especially for labs without much bioinformatics support. The growing interest in the field of proteomics has triggered an increase in the development of new software libraries, including freely available and open-source software. From database search analysis to post-processing of the identification results, even though the objectives of these libraries and packages can vary significantly, they usually share a number of features. Common use cases include the handling of protein and peptide sequences, the parsing of results from various proteomics search engines output files, and the visualization of MS-related information (including mass spectra and chromatograms). In this review, we provide an overview of the existing software libraries, open-source frameworks and also, we give information on some of the freely available applications which make use of them. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era. Guest Editors: Martin Eisenacher and Christian Stephan.
•A review of existing open-source software for computational proteomics.•Available software for each step in a typical MS experiment is described.•OpenMS, TPP, compomics, ProteoWizard, JPL, PRIDE toolsuite are covered in detail.•Different programming languages are considered (Java, Perl, C++ or Python).
AMT, Accurate Mass Tag; ATAQS, Automated and Targeted Analysis with Quantitative SRM; CV, Controlled Vocabulary; DAO, Data Access Object; EBI, European Bioinformatics Institute; emPAI, exponentially modified Protein Abundance Index; FDR, False Discovery Rate; (HUPO)-PSI, (Human Proteome Organization) — Proteomics Standards Initiative; GUI, Graphical User Interface; ICAT, Isotope-Coded Affinity Tags; ICPL, Isotope-Coded Protein Label; IPTL, Isobaric Peptide Termini Labeling; ISB, Institute for Systems Biology; iTRAQ, Isobaric Tag for Relative and Absolute Quantitation; JPL, Java Proteomic Library; LC-MS, Liquid Chromatography–Mass Spectrometry; LIMS, Laboratory Information Management System; MGF, Mascot Generic Format; MIAPE, Minimum Information About a Proteomics Experiment; MS, Mass Spectrometry; SILAC, Stable Isotope Labeling by Amino acids in Cell culture; PASSEL, PeptideAtlas SRM Experiment Library; PRIDE, PRoteomics IDEntifications (database); PSM, Peptide Spectrum Match; PTM, Post-Translational Modifications; RT, Retention Time; SRM, Selected Reaction Monitoring; TMT, Tandem Mass Tag; TOPP, The OpenMS Proteomics Pipeline; TPP, Trans-Proteomic Pipeline; Proteomics; Databases; Bioinformatics; Software libraries; Application programming interface; Open source software
Reactome (http://www.reactome.org) is a manually curated open-source open-data resource of human pathways and reactions. The current version 46 describes 7088 human proteins (34% of the predicted human proteome), participating in 6744 reactions based on data extracted from 15 107 research publications with PubMed links. The Reactome Web site and analysis tool set have been completely redesigned to increase speed, flexibility and user friendliness. The data model has been extended to support annotation of disease processes due to infectious agents and to mutation.
IntAct (freely available at http://www.ebi.ac.uk/intact) is an open-source, open data molecular interaction database populated by data either curated from the literature or from direct data depositions. IntAct has developed a sophisticated web-based curation tool, capable of supporting both IMEx- and MIMIx-level curation. This tool is now utilized by multiple additional curation teams, all of whom annotate data directly into the IntAct database. Members of the IntAct team supply appropriate levels of training, perform quality control on entries and take responsibility for long-term data maintenance. Recently, the MINT and IntAct databases decided to merge their separate efforts to make optimal use of limited developer resources and maximize the curation output. All data manually curated by the MINT curators have been moved into the IntAct database at EMBL-EBI and are merged with the existing IntAct dataset. Both IntAct and MINT are active contributors to the IMEx consortium (http://www.imexconsortium.org).
The complex biological processes that control cellular function are mediated by intricate networks of molecular interactions. Accumulating evidence indicates that these interactions are often interdependent, thus acting cooperatively. Cooperative interactions are prevalent in and indispensible for reliable and robust control of cell regulation, as they underlie the conditional decision-making capability of large regulatory complexes. Despite an increased focus on experimental elucidation of the molecular details of cooperative binding events, as evidenced by their growing occurrence in literature, they are currently lacking from the main bioinformatics resources. One of the contributing factors to this deficiency is the lack of a computer-readable standard representation and exchange format for cooperative interaction data. To tackle this shortcoming, we added functionality to the widely used PSI-MI interchange format for molecular interaction data by defining new controlled vocabulary terms that allow annotation of different aspects of cooperativity without making structural changes to the underlying XML schema. As a result, we are able to capture cooperative interaction data in a structured format that is backward compatible with PSI-MI–based data and applications. This will facilitate the storage, exchange and analysis of cooperative interaction data, which in turn will advance experimental research on this fundamental principle in biology.
The IMEx consortium is an international collaboration between major public interaction data providers to share curation effort and make a non-redundant set of protein interactions available in a single search interface on a common website (www.imexconsortium.org). Common curation rules have been developed and a central registry is used to manage the selection of articles to enter into the dataset. The advantages of such a service to the user, quality control measures adopted and data distribution practices are discussed.
The Proteomics Standards Initiative has recently released the mzIdentML data standard for representing peptide and protein identification results, for example, created by a search engine. When a new standard format is produced, it is important that software tools are available that make it straightforward for laboratory scientists to use it routinely and for bioinformaticians to embed support in their own tools. Here we report the release of several open-source Java-based software packages based on mzIdentML: ProteoIDViewer, mzidLibrary, and mzidValidator. The ProteoIDViewer is a desktop application allowing users to visualize mzIdentML-formatted results originating from any appropriate identification software; it supports visualization of all the features of the mzIdentML format. The mzidLibrary is a software library containing routines for importing data from external search engines, post-processing identification data (such as false discovery rate calculations), combining results from multiple search engines, performing protein inference, setting identification thresholds, and exporting results from mzIdentML to plain text files. The mzidValidator is able to process files and report warnings or errors if files are not correctly formatted or contain some semantic error. We anticipate that these developments will simplify adoption of the new standard in proteomics laboratories and the integration of mzIdentML into other software tools. All three tools are freely available in the public domain.
Summary: We present iAnn, an open source community-driven platform for dissemination of life science events, such as courses, conferences and workshops. iAnn allows automatic visualisation and integration of customised event reports. A central repository lies at the core of the platform: curators add submitted events, and these are subsequently accessed via web services. Thus, once an iAnn widget is incorporated into a website, it permanently shows timely relevant information as if it were native to the remote site. At the same time, announcements submitted to the repository are automatically disseminated to all portals that query the system. To facilitate the visualization of announcements, iAnn provides powerful filtering options and views, integrated in Google Maps and Google Calendar. All iAnn widgets are freely available.
The Proteomics Standard Initiative Common QUery InterfaCe (PSICQUIC) specification was
created by the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) to
enable computational access to molecular-interaction data resources by means of a standard
Web Service and query language. Currently providing >150 million binary interaction
evidences from 28 servers globally, the PSICQUIC interface allows the concurrent search of
multiple molecular-interaction information resources using a single query. Here, we
present an extension of the PSICQUIC specification (version 1.3), which has been released
to be compliant with the enhanced standards in molecular interactions. The new release
also includes a new reference implementation of the PSICQUIC server available to the data
providers. It offers augmented web service capabilities and improves the user experience.
PSICQUIC has been running for almost 5 years, with a user base growing from only 4 data
providers to 28 (April 2013) allowing access to 151 310 109 binary interactions. The power
of this web service is shown in PSICQUIC View web application, an example of how to
simultaneously query, browse and download results from the different PSICQUIC servers.
This application is free and open to all users with no login requirement (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml).
Protein sequence databases are the pillar upon which modern proteomics is supported, representing a stable reference space of predicted and validated proteins. One example of such resources is UniProt, enriched with both expertly curated and automatic annotations. Taken largely for granted, similar mature resources such as UniProt are not available yet in some other “omics” fields, lipidomics being one of them. While having a seasoned community of wet lab scientists, lipidomics lies significantly behind proteomics in the adoption of data standards and other core bioinformatics concepts. This work aims to reduce the gap by developing an equivalent resource to UniProt called ‘LipidHome’, providing theoretically generated lipid molecules and useful metadata. Using the ‘FASTLipid’ Java library, a database was populated with theoretical lipids, generated from a set of community agreed upon chemical bounds. In parallel, a web application was developed to present the information and provide computational access via a web service. Designed specifically to accommodate high throughput mass spectrometry based approaches, lipids are organised into a hierarchy that reflects the variety in the structural resolution of lipid identifications. Additionally, cross-references to other lipid related resources and papers that cite specific lipids were used to annotate lipid records. The web application encompasses a browser for viewing lipid records and a ‘tools’ section where an MS1 search engine is currently implemented. LipidHome can be accessed at http://www.ebi.ac.uk/apweiler-srv/lipidhome.
The MIRIAM Registry (http://www.ebi.ac.uk/miriam/) records information about collections of data in the life sciences, as well as where it can be obtained. This information is used, in combination with the resolving infrastructure of Identifiers.org (http://identifiers.org/), to generate globally unique identifiers, in the form of Uniform Resource Identifier. These identifiers are now widely used to provide perennial cross-references and annotations. The growing demand for these identifiers results in a significant increase in curational efforts to maintain the underlying registry. This requires the design and implementation of an economically viable and sustainable solution able to cope with such expansion. We briefly describe the Registry, the current curation duties entailed, and our plans to extend and distribute this workload through collaborative and community efforts.
Controlled vocabularies (CVs), i.e. a collection of predefined terms describing a modeling domain, used for the semantic annotation of data, and ontologies are used in structured data formats and databases to avoid inconsistencies in annotation, to have a unique (and preferably short) accession number and to give researchers and computer algorithms the possibility for more expressive semantic annotation of data. The Human Proteome Organization (HUPO)–Proteomics Standards Initiative (PSI) makes extensive use of ontologies/CVs in their data formats. The PSI-Mass Spectrometry (MS) CV contains all the terms used in the PSI MS–related data standards. The CV contains a logical hierarchical structure to ensure ease of maintenance and the development of software that makes use of complex semantics. The CV contains terms required for a complete description of an MS analysis pipeline used in proteomics, including sample labeling, digestion enzymes, instrumentation parts and parameters, software used for identification and quantification of peptides/proteins and the parameters and scores used to determine their significance. Owing to the range of topics covered by the CV, collaborative development across several PSI working groups, including proteomics research groups, instrument manufacturers and software vendors, was necessary. In this article, we describe the overall structure of the CV, the process by which it has been developed and is maintained and the dependencies on other ontologies.
Database URL: http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo
Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the U.S. National Cancer Institute (NCI) convened the “International Workshop on Proteomic Data Quality Metrics” in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed up on two primary needs for the wide use of quality metrics: (1) an evolving list of comprehensive quality metrics and (2) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data.
By agreement, this article is published simultaneously in the Journal of Proteome Research, Molecular and Cellular Proteomics, Proteomics, and Proteomics Clinical Applications as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.
selected reaction monitoring; bioinformatics; data quality; metrics; open access; Amsterdam Principles; standards
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed.
We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed.
The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
The PRoteomics IDEntifications (PRIDE, http://www.ebi.ac.uk/pride) database at the European Bioinformatics Institute is one of the most prominent data repositories of mass spectrometry (MS)-based proteomics data. Here, we summarize recent developments in the PRIDE database and related tools. First, we provide up-to-date statistics in data content, splitting the figures by groups of organisms and species, including peptide and protein identifications, and post-translational modifications. We then describe the tools that are part of the PRIDE submission pipeline, especially the recently developed PRIDE Converter 2 (new submission tool) and PRIDE Inspector (visualization and analysis tool). We also give an update about the integration of PRIDE with other MS proteomics resources in the context of the ProteomeXchange consortium. Finally, we briefly review the quality control efforts that are ongoing at present and outline our future plans.
The availability of comprehensive information about enzymes plays an important role in answering questions relevant to interdisciplinary fields such as biochemistry, enzymology, biofuels, bioengineering and drug discovery. At the EMBL European Bioinformatics Institute, we have developed an enzyme portal (http://www.ebi.ac.uk/enzymeportal) to provide this wealth of information on enzymes from multiple in-house resources addressing particular data classes: protein sequence and structure, reactions, pathways and small molecules. The fact that these data reside in separate databases makes information discovery cumbersome. The main goal of the portal is to simplify this process for end users.
Reactome describes biological pathways as chemical reactions that closely mirror the actual physical interactions that occur in the cell. Recent extensions of our data model accommodate the annotation of cancer and other disease processes. First, we have extended our class of protein modifications to accommodate annotation of changes in amino acid sequence and the formation of fusion proteins to describe the proteins involved in disease processes. Second, we have added a disease attribute to reaction, pathway, and physical entity classes that uses disease ontology terms. To support the graphical representation of “cancer” pathways, we have adapted our Pathway Browser to display disease variants and events in a way that allows comparison with the wild type pathway, and shows connections between perturbations in cancer and other biological pathways. The curation of pathways associated with cancer, coupled with our efforts to create other disease-specific pathways, will interoperate with our existing pathway and network analysis tools. Using the Epidermal Growth Factor Receptor (EGFR) signaling pathway as an example, we show how Reactome annotates and presents the altered biological behavior of EGFR variants due to their altered kinase and ligand-binding properties, and the mode of action and specificity of anti-cancer therapeutics.
pathway database; pathway visualization; network visualization; cancer annotation; EGFR signaling
A large number of diverse, complex, and distributed data resources are currently available in the Bioinformatics domain. The pace of discovery and the diversity of information means that centralised reference databases like UniProt and Ensembl cannot integrate all potentially relevant information sources. From a user perspective however, centralised access to all relevant information concerning a specific query is essential. The Distributed Annotation System (DAS) defines a communication protocol to exchange annotations on genomic and protein sequences; this standardisation enables clients to retrieve data from a myriad of sources, thus offering centralised access to end-users.
We introduce MyDas, a web server that facilitates the publishing of biological annotations according to the DAS specification. It deals with the common functionality requirements of making data available, while also providing an extension mechanism in order to implement the specifics of data store interaction. MyDas allows the user to define where the required information is located along with its structure, and is then responsible for the communication protocol details.
The original PRIDE Converter tool greatly simplified the process of submitting mass spectrometry (MS)-based proteomics data to the PRIDE database. However, after much user feedback, it was noted that the tool had some limitations and could not handle several user requirements that were now becoming commonplace. This prompted us to design and implement a whole new suite of tools that would build on the successes of the original PRIDE Converter and allow users to generate submission-ready, well-annotated PRIDE XML files. The PRIDE Converter 2 tool suite allows users to convert search result files into PRIDE XML (the format needed for performing submissions to the PRIDE database), generate mzTab skeleton files that can be used as a basis to submit quantitative and gel-based MS data, and post-process PRIDE XML files by filtering out contaminants and empty spectra, or by merging several PRIDE XML files together. All the tools have both a graphical user interface that provides a dialog-based, user-friendly way to convert and prepare files for submission, as well as a command-line interface that can be used to integrate the tools into existing or novel pipelines, for batch processing and power users. The PRIDE Converter 2 tool suite will thus become a cornerstone in the submission process to PRIDE and, by extension, to the ProteomeXchange consortium of MS-proteomics data repositories.
New methods for performing quantitative proteome analyses based on differential labeling protocols or label-free techniques are reported in the literature on an almost monthly basis. In parallel, a correspondingly vast number of software tools for the analysis of quantitative proteomics data has also been described in the literature and produced by private companies. In this article we focus on the review of some of the most popular techniques in the field and present a critical appraisal of several software packages available to process and analyze the data produced. We also describe the importance of community standards to support the wide range of software, which may assist researchers in the analysis of data using different platforms and protocols. It is intended that this review will serve bench scientists both as a useful reference and a guide to the selection and use of different pipelines to perform quantitative proteomics data analysis. We have produced a web-based tool (http://www.proteosuite.org/?q=other_resources) to help researchers find appropriate software for their local instrumentation, available file formats, and quantitative methodology.
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
The Protein Identifier Cross-Reference (PICR) service is a tool that allows users to map protein identifiers, protein sequences and gene identifiers across over 100 different source databases. PICR takes input through an interactive website as well as Representational State Transfer (REST) and Simple Object Access Protocol (SOAP) services. It returns the results as HTML pages, XLS and CSV files. It has been in production since 2007 and has been recently enhanced to add new functionality and increase the number of databases it covers. Protein subsequences can be Basic Local Alignment Search Tool (BLAST) against the UniProt Knowledgebase (UniProtKB) to provide an entry point to the standard PICR mapping algorithm. In addition, gene identifiers from UniProtKB and Ensembl can now be submitted as input or mapped to as output from PICR. We have also implemented a ‘best-guess’ mapping algorithm for UniProt. In this article, we describe the usefulness of PICR, how these changes have been implemented, and the corresponding additions to the web services. Finally, we explain that the number of source databases covered by PICR has increased from the initial 73 to the current 102. New resources include several new species-specific Ensembl databases as well as the Ensembl Genome ones. PICR can be accessed at http://www.ebi.ac.uk/Tools/picr/.