Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average 75% of spectra analysed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large-scale to shed a light on these unidentified spectra. PRoteomics IDEntifications database (PRIDE) Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in PRIDE Archive, coming from hundreds of datasets, we were able to consistently characterize three distinct groups of spectra: 1) incorrectly identified spectra, 2) spectra correctly identified but below the set scoring threshold, and 3) truly unidentified spectra. Using a multitude of complementary analysis approaches, we were able to identify less than 20% of the consistently unidentified spectra. The complete spectrum clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster). This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.
Every data-rich community research effort requires a clear plan for ensuring the quality of the data interpretation and comparability of analyses. To address this need within the Human Proteome Project (HPP) of the Human Proteome Organization (HUPO), we have developed through broad consultation a set of mass spectrometry data interpretation guidelines that should be applied to all HPP data contributions. For submission of manuscripts reporting HPP protein identification results, the guidelines are presented as a one-page checklist containing fifteen essential points followed by two pages of expanded description of each. Here, we present an overview of the guidelines and provide an in-depth description of each of the fifteen elements to facilitate understanding of the intentions and rationale behind the guidelines, both for authors and for reviewers. Broadly, these guidelines provide specific directions regarding how HPP data are to be submitted to mass spectrometry data repositories, how error analysis should be presented, and how detection of novel proteins should be supported with additional confirmatory evidence. These guidelines, developed by the HPP community, are presented to the broader scientific community for further discussion.
Guidelines; standards; Human Proteome Project; mass spectrometry; false-discovery rates; alternative protein matches
A large gap remains between the amount of knowledge in scientific literature and the fraction that gets curated into standardized databases, despite many curation initiatives. Yet the availability of comprehensive knowledge in databases is crucial for exploiting existing background knowledge, both for designing follow-up experiments and for interpreting new experimental data. Structured resources also underpin the computational integration and modeling of regulatory pathways, which further aids our understanding of regulatory dynamics. We argue how cooperation between the scientific community and professional curators can increase the capacity of capturing precise knowledge from literature. We demonstrate this with a project in which we mobilize biological domain experts who curate large amounts of DNA binding transcription factors, and show that they, although new to the field of curation, can make valuable contributions by harvesting reported knowledge from scientific papers. Such community curation can enhance the scientific epistemic process.
Database URL: http://www.tfcheckpoint.org
Objective To describe the goals of the Proteomics Standards Initiative (PSI) of the Human Proteome Organization, the methods that the PSI has employed to create data standards, the resulting output of the PSI, lessons learned from the PSI’s evolution, and future directions and synergies for the group.
Materials and Methods The PSI has 5 categories of deliverables that have guided the group. These are minimum information guidelines, data formats, controlled vocabularies, resources and software tools, and dissemination activities. These deliverables are produced via the leadership and working group organization of the initiative, driven by frequent workshops and ongoing communication within the working groups. Official standards are subjected to a rigorous document process that includes several levels of peer review prior to release.
Results We have produced and published minimum information guidelines describing what information should be provided when making data public, either via public repositories or other means. The PSI has produced a series of standard formats covering mass spectrometer input, mass spectrometer output, results of informatics analysis (both qualitative and quantitative analyses), reports of molecular interaction data, and gel electrophoresis analyses. We have produced controlled vocabularies that ensure that concepts are uniformly annotated in the formats and engaged in extensive software development and dissemination efforts so that the standards can efficiently be used by the community.
Conclusion In its first dozen years of operation, the PSI has produced many standards that have accelerated the field of proteomics by facilitating data exchange and deposition to data repositories. We look to the future to continue developing standards for new proteomics technologies and workflows and mechanisms for integration with other omics data types. Our products facilitate the translation of genomics and proteomics findings to clinical and biological phenotypes. The PSI website can be accessed at http://www.psidev.info.
standards; data standards; data formats; guidelines; proteomics; standards organization; HUPO; proteomics standards initiative
In this article we discuss the requirements to use data mining of published proteomics datasets to assist proteomics‐based biomarker discovery, the use of external data integration to solve the issue of inadequate small sample sizes and finally, we try to estimate the probability that new biomarkers will be identified through data mining alone.
Bioinformatics; Biomarker; Databases; Data mining; Mass spectrometry
The exponential increase in Big Data generation combined with limited capitalization on the wealth of information embedded within Big Data have prompted us to revisit our scientific discovery paradigms. A successful transition into this digital era of medicine holds great promise for advancing fundamental knowledge in biology, innovating human health and driving personalized medicine, however, this will require a drastic shift of research culture in how we conceptualize science and use data. An e-transformation will require global adoption and synergism among computational science, biomedical research and clinical domains.
data science; Omics; users; computational tools; crowdsourcing
The Reactome database of curated biological pathways provides a tool for visualizing user-supplied expression data as an overlay on pathway diagrams, thereby providing an effective means to examine expression of the constituents of the pathway and determine whether all that are necessary are present. Several experiments can be visualized in succession, to determine whether expression changes with experimental conditions, a useful feature for examining a time-course, dose-response or disease progression.
Reactome; Pathway; Expression Analysis; Microarray; quantitative proteomics
A recent community survey conducted by Infrastructure for Systems Biology Europe (ISBE) informs requirements for developing an efficient infrastructure for systems biology standards, data and model management.
Methods & Resources
The Reactome Knowledgebase (www.reactome.org) provides molecular details of signal transduction, transport, DNA replication, metabolism and other cellular processes as an ordered network of molecular transformations—an extended version of a classic metabolic map, in a single consistent data model. Reactome functions both as an archive of biological processes and as a tool for discovering unexpected functional relationships in data such as gene expression pattern surveys or somatic mutation catalogues from tumour cells. Over the last two years we redeveloped major components of the Reactome web interface to improve usability, responsiveness and data visualization. A new pathway diagram viewer provides a faster, clearer interface and smooth zooming from the entire reaction network to the details of individual reactions. Tool performance for analysis of user datasets has been substantially improved, now generating detailed results for genome-wide expression datasets within seconds. The analysis module can now be accessed through a RESTFul interface, facilitating its inclusion in third party applications. A new overview module allows the visualization of analysis results on a genome-wide Reactome pathway hierarchy using a single screen page. The search interface now provides auto-completion as well as a faceted search to narrow result lists efficiently.
The original PRIDE Inspector tool was developed as an open source standalone tool to enable the visualization and validation of mass-spectrometry (MS)-based proteomics data before data submission or already publicly available in the Proteomics Identifications (PRIDE) database. The initial implementation of the tool focused on visualizing PRIDE data by supporting the PRIDE XML format and a direct access to private (password protected) and public experiments in PRIDE.
The ProteomeXchange (PX) Consortium has been set up to enable a better integration of existing public proteomics repositories, maximizing its benefit to the scientific community through the implementation of standard submission and dissemination pipelines. Within the Consortium, PRIDE is focused on supporting submissions of tandem MS data. The increasing use and popularity of the new Proteomics Standards Initiative (PSI) data standards such as mzIdentML and mzTab, and the diversity of workflows supported by the PX resources, prompted us to design and implement a new suite of algorithms and libraries that would build upon the success of the original PRIDE Inspector and would enable users to visualize and validate PX “complete” submissions. The PRIDE Inspector Toolsuite supports the handling and visualization of different experimental output files, ranging from spectra (mzML, mzXML, and the most popular peak lists formats) and peptide and protein identification results (mzIdentML, PRIDE XML, mzTab) to quantification data (mzTab, PRIDE XML), using a modular and extensible set of open-source, cross-platform libraries. We believe that the PRIDE Inspector Toolsuite represents a milestone in the visualization and quality assessment of proteomics data. It is freely available at http://github.com/PRIDE-Toolsuite/.
The PRoteomics IDEntifications (PRIDE) database is one of the world-leading data repositories of mass spectrometry (MS)-based proteomics data. Since the beginning of 2014, PRIDE Archive (http://www.ebi.ac.uk/pride/archive/) is the new PRIDE archival system, replacing the original PRIDE database. Here we summarize the developments in PRIDE resources and related tools since the previous update manuscript in the Database Issue in 2013. PRIDE Archive constitutes a complete redevelopment of the original PRIDE, comprising a new storage backend, data submission system and web interface, among other components. PRIDE Archive supports the most-widely used PSI (Proteomics Standards Initiative) data standard formats (mzML and mzIdentML) and implements the data requirements and guidelines of the ProteomeXchange Consortium. The wide adoption of ProteomeXchange within the community has triggered an unprecedented increase in the number of submitted data sets (around 150 data sets per month). We outline some statistics on the current PRIDE Archive data contents. We also report on the status of the PRIDE related stand-alone tools: PRIDE Inspector, PRIDE Converter 2 and the ProteomeXchange submission tool. Finally, we will give a brief update on the resources under development ‘PRIDE Cluster’ and ‘PRIDE Proteomes’, which provide a complementary view and quality-scored information of the peptide and protein identification data available in PRIDE Archive.
Reactome (http://www.reactome.org) is an open source, expert-authored, peer-reviewed, manually curated database of reactions, pathways and biological processes. We provide an intuitive web-based user interface to pathway knowledge and a suite of data analysis tools. The Pathway Browser is a Systems Biology Graphical Notation (SBGN)-like visualization system that supports manual navigation of pathways by zooming, scrolling and event highlighting, and that exploits PSI Common Query Interface (PSIQUIC) web services to overlay pathways with molecular interaction data from the Reactome Functional Interaction (FI) Network and interaction databases such as IntAct, ChEMBL, and BioGRID. Pathway and Expression Analysis tools employ web services to provide ID mapping, pathway assignment and over-representation analysis of user-supplied datasets. By applying Ensembl Compara to curated human proteins and reactions, Reactome generates pathway inferences for 20 other species. The Species Comparison tool provides a summary of results for each of these species as a table showing numbers of orthologous proteins found by pathway from which users can navigate to inferred details for specific proteins and reactions. Reactome’s diverse pathway knowledge and suite of data analysis tools provide a platform for data mining, modeling and the analysis of large-scale proteomics datasets.
Pathway database; Pathway visualization; Pathway analysis; BioMart; Data integration
Molecular interaction databases are essential resources that enable access to a wealth of information on associations between proteins and other biomolecules. Network graphs generated from these data provide an understanding of the relationships between different proteins in the cell, and network analysis has become a widespread tool supporting –omics analysis. Meaningfully representing this information remains far from trivial and different databases strive to provide users with detailed records capturing the experimental details behind each piece of interaction evidence. A targeted curation approach is necessary to transfer published data generated by primarily low-throughput techniques into interaction databases. In this review we present an example highlighting the value of both targeted curation and the subsequent effective visualization of detailed features of manually curated interaction information. We have curated interactions involving LRRK2, a protein of largely unknown function linked to familial forms of Parkinson's disease, and hosted the data in the IntAct database. This LRRK2-specific dataset was then used to produce different visualization examples highlighting different aspects of the data: the level of confidence in the interaction based on orthogonal evidence, those interactions found under close-to-native conditions, and the enzyme–substrate relationships in different in vitro enzymatic assays. Finally, pathway annotation taken from the Reactome database was overlaid on top of interaction networks to bring biological functional context to interaction maps.
Bioinformatics; Curation; Data visualization; Molecular interaction database; Parkinson's disease; Protein interaction network
Compared to other data-intensive disciplines such as genomics, public deposition and storage of MS-based proteomics, data are still less developed due to, among other reasons, the inherent complexity of the data and the variety of data types and experimental workflows. In order to address this need, several public repositories for MS proteomics experiments have been developed, each with different purposes in mind. The most established resources are the Global Proteome Machine Database (GPMDB), PeptideAtlas, and the PRIDE database. Additionally, there are other useful (in many cases recently developed) resources such as ProteomicsDB, Mass Spectrometry Interactive Virtual Environment (MassIVE), Chorus, MaxQB, PeptideAtlas SRM Experiment Library (PASSEL), Model Organism Protein Expression Database (MOPED), and the Human Proteinpedia. In addition, the ProteomeXchange consortium has been recently developed to enable better integration of public repositories and the coordinated sharing of proteomics information, maximizing its benefit to the scientific community. Here, we will review each of the major proteomics resources independently and some tools that enable the integration, mining and reuse of the data. We will also discuss some of the major challenges and current pitfalls in the integration and sharing of the data.
Bioinformatics; Databases; MS; Repositories
Bioinformatics; Chart; Hierarchy; Network; Software library
Summary: The ms-data-core-api is a free, open-source library for developing computational proteomics tools and pipelines. The Application Programming Interface, written in Java, enables rapid tool creation by providing a robust, pluggable programming interface and common data model. The data model is based on controlled vocabularies/ontologies and captures the whole range of data types included in common proteomics experimental workflows, going from spectra to peptide/protein identifications to quantitative results. The library contains readers for three of the most used Proteomics Standards Initiative standard file formats: mzML, mzIdentML, and mzTab. In addition to mzML, it also supports other common mass spectra data formats: dta, ms2, mgf, pkl, apl (text-based), mzXML and mzData (XML-based). Also, it can be used to read PRIDE XML, the original format used by the PRIDE database, one of the world-leading proteomics resources. Finally, we present a set of algorithms and tools whose implementation illustrates the simplicity of developing applications using the library.
Availability and implementation: The software is freely available at https://github.com/PRIDE-Utilities/ms-data-core-api.
Supplementary data are available at Bioinformatics online
The PRIDE (PRoteomics IDEntifications) database is one of the world-leading public repositories of mass spectrometry (MS)-based proteomics data and it is a founding member of the ProteomeXchange Consortium of proteomics resources. In the original PRIDE database system, users could access data programmatically by accessing the web services provided by the PRIDE BioMart interface. New REST (REpresentational State Transfer) web services have been developed to serve the most popular functionality provided by BioMart (now discontinued due to data scalability issues) and address the data access requirements of the newly developed PRIDE Archive. Using the API (Application Programming Interface) it is now possible to programmatically query for and retrieve peptide and protein identifications, project and assay metadata and the originally submitted files. Searching and filtering is also possible by metadata information, such as sample details (e.g. species and tissues), instrumentation (mass spectrometer), keywords and other provided annotations. The PRIDE Archive web services were first made available in April 2014. The API has already been adopted by a few applications and standalone tools such as PeptideShaker, PRIDE Inspector, the Unipept web application and the Python-based BioServices package. This application is free and open to all users with no login requirement and can be accessed at http://www.ebi.ac.uk/pride/ws/archive/.
Data standardization; Human Proteome Organisation; Proteomics Standards Initiative
The manual curation of the information in biomedical resources is an expensive task. This article argues the value of this approach in comparison with other apparently less costly options, such as automated annotation or text-mining, then discusses ways in which databases can make cost savings by sharing infrastructure and tool development. Sharing curation effort is a model already being adopted by several data resources. Approaches taken by two of these, the Gene Ontology annotation effort and the IntAct molecular interaction database, are reviewed in more detail. These models help to ensure long-term persistence of curated data and minimizes redundant development of resources by multiple disparate groups.
http://www.ebi.ac.uk/intact and http://www.ebi.ac.uk/GOA/
The evidence that two molecules interact in a living cell is often inferred from multiple different experiments. Experimental data is captured in multiple repositories, but there is no simple way to assess the evidence of an interaction occurring in a cellular environment. Merging and scoring of data are commonly required operations after querying for the details of specific molecular interactions, to remove redundancy and assess the strength of accompanying experimental evidence. We have developed both a merging algorithm and a scoring system for molecular interactions based on the proteomics standard initiative–molecular interaction standards. In this manuscript, we introduce these two algorithms and provide community access to the tool suite, describe examples of how these tools are useful to selectively present molecular interaction data and demonstrate a case where the algorithms were successfully used to identify a systematic error in an existing dataset.
Motivation: On the semantic web, in life sciences in particular, data is often distributed via multiple resources. Each of these sources is likely to use their own International Resource Identifier for conceptually the same resource or database record. The lack of correspondence between identifiers introduces a barrier when executing federated SPARQL queries across life science data.
Results: We introduce a novel SPARQL-based service to enable on-the-fly integration of life science data. This service uses the identifier patterns defined in the Identifiers.org Registry to generate a plurality of identifier variants, which can then be used to match source identifiers with target identifiers. We demonstrate the utility of this identifier integration approach by answering queries across major producers of life science Linked Data.
Availability and implementation: The SPARQL-based identifier conversion service is available without restriction at http://identifiers.org/services/sparql.
BioModels (http://www.ebi.ac.uk/biomodels/) is a repository of mathematical models of biological processes. A large set of models is curated to verify both correspondence to the biological process that the model seeks to represent, and reproducibility of the simulation results as described in the corresponding peer-reviewed publication. Many models submitted to the database are annotated, cross-referencing its components to external resources such as database records, and terms from controlled vocabularies and ontologies. BioModels comprises two main branches: one is composed of models derived from literature, while the second is generated through automated processes. BioModels currently hosts over 1200 models derived directly from the literature, as well as in excess of 140 000 models automatically generated from pathway resources. This represents an approximate 60-fold growth for literature-based model numbers alone, since BioModels’ first release a decade ago. This article describes updates to the resource over this period, which include changes to the user interface, the annotation profiles of models in the curation pipeline, major infrastructure changes, ability to perform online simulations and the availability of model content in Linked Data form. We also outline planned improvements to cope with a diverse array of new challenges.
mzTab is the most recent standard format developed by the Proteomics Standards Initiative. mzTab is a flexible tab-delimited file that can capture identification and quantification results coming from MS-based proteomics and metabolomics approaches. We here present an open-source Java application programming interface for mzTab called jmzTab. The software allows the efficient processing of mzTab files, providing read and write capabilities, and is designed to be embedded in other software packages. The second key feature of the jmzTab model is that it provides a flexible framework to maintain the logical integrity between the metadata and the table-based sections in the mzTab files. In this article, as two example implementations, we also describe two stand-alone tools that can be used to validate mzTab files and to convert PRIDE XML files to mzTab. The library is freely available at http://mztab.googlecode.com.
Bioinformatics; Data standard; Java application programming interface; Proteomics Standards Initiative