Controlled vocabularies (CVs), i.e. a collection of predefined terms describing a modeling domain, used for the semantic annotation of data, and ontologies are used in structured data formats and databases to avoid inconsistencies in annotation, to have a unique (and preferably short) accession number and to give researchers and computer algorithms the possibility for more expressive semantic annotation of data. The Human Proteome Organization (HUPO)–Proteomics Standards Initiative (PSI) makes extensive use of ontologies/CVs in their data formats. The PSI-Mass Spectrometry (MS) CV contains all the terms used in the PSI MS–related data standards. The CV contains a logical hierarchical structure to ensure ease of maintenance and the development of software that makes use of complex semantics. The CV contains terms required for a complete description of an MS analysis pipeline used in proteomics, including sample labeling, digestion enzymes, instrumentation parts and parameters, software used for identification and quantification of peptides/proteins and the parameters and scores used to determine their significance. Owing to the range of topics covered by the CV, collaborative development across several PSI working groups, including proteomics research groups, instrument manufacturers and software vendors, was necessary. In this article, we describe the overall structure of the CV, the process by which it has been developed and is maintained and the dependencies on other ontologies.
Database URL: http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo
Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the U.S. National Cancer Institute (NCI) convened the “International Workshop on Proteomic Data Quality Metrics” in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed up on two primary needs for the wide use of quality metrics: (1) an evolving list of comprehensive quality metrics and (2) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data.
By agreement, this article is published simultaneously in the Journal of Proteome Research, Molecular and Cellular Proteomics, Proteomics, and Proteomics Clinical Applications as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.
selected reaction monitoring; bioinformatics; data quality; metrics; open access; Amsterdam Principles; standards
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed.
We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed.
The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
The PRoteomics IDEntifications (PRIDE, http://www.ebi.ac.uk/pride) database at the European Bioinformatics Institute is one of the most prominent data repositories of mass spectrometry (MS)-based proteomics data. Here, we summarize recent developments in the PRIDE database and related tools. First, we provide up-to-date statistics in data content, splitting the figures by groups of organisms and species, including peptide and protein identifications, and post-translational modifications. We then describe the tools that are part of the PRIDE submission pipeline, especially the recently developed PRIDE Converter 2 (new submission tool) and PRIDE Inspector (visualization and analysis tool). We also give an update about the integration of PRIDE with other MS proteomics resources in the context of the ProteomeXchange consortium. Finally, we briefly review the quality control efforts that are ongoing at present and outline our future plans.
The availability of comprehensive information about enzymes plays an important role in answering questions relevant to interdisciplinary fields such as biochemistry, enzymology, biofuels, bioengineering and drug discovery. At the EMBL European Bioinformatics Institute, we have developed an enzyme portal (http://www.ebi.ac.uk/enzymeportal) to provide this wealth of information on enzymes from multiple in-house resources addressing particular data classes: protein sequence and structure, reactions, pathways and small molecules. The fact that these data reside in separate databases makes information discovery cumbersome. The main goal of the portal is to simplify this process for end users.
A large number of diverse, complex, and distributed data resources are currently available in the Bioinformatics domain. The pace of discovery and the diversity of information means that centralised reference databases like UniProt and Ensembl cannot integrate all potentially relevant information sources. From a user perspective however, centralised access to all relevant information concerning a specific query is essential. The Distributed Annotation System (DAS) defines a communication protocol to exchange annotations on genomic and protein sequences; this standardisation enables clients to retrieve data from a myriad of sources, thus offering centralised access to end-users.
We introduce MyDas, a web server that facilitates the publishing of biological annotations according to the DAS specification. It deals with the common functionality requirements of making data available, while also providing an extension mechanism in order to implement the specifics of data store interaction. MyDas allows the user to define where the required information is located along with its structure, and is then responsible for the communication protocol details.
The original PRIDE Converter tool greatly simplified the process of submitting mass spectrometry (MS)-based proteomics data to the PRIDE database. However, after much user feedback, it was noted that the tool had some limitations and could not handle several user requirements that were now becoming commonplace. This prompted us to design and implement a whole new suite of tools that would build on the successes of the original PRIDE Converter and allow users to generate submission-ready, well-annotated PRIDE XML files. The PRIDE Converter 2 tool suite allows users to convert search result files into PRIDE XML (the format needed for performing submissions to the PRIDE database), generate mzTab skeleton files that can be used as a basis to submit quantitative and gel-based MS data, and post-process PRIDE XML files by filtering out contaminants and empty spectra, or by merging several PRIDE XML files together. All the tools have both a graphical user interface that provides a dialog-based, user-friendly way to convert and prepare files for submission, as well as a command-line interface that can be used to integrate the tools into existing or novel pipelines, for batch processing and power users. The PRIDE Converter 2 tool suite will thus become a cornerstone in the submission process to PRIDE and, by extension, to the ProteomeXchange consortium of MS-proteomics data repositories.
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
The Protein Identifier Cross-Reference (PICR) service is a tool that allows users to map protein identifiers, protein sequences and gene identifiers across over 100 different source databases. PICR takes input through an interactive website as well as Representational State Transfer (REST) and Simple Object Access Protocol (SOAP) services. It returns the results as HTML pages, XLS and CSV files. It has been in production since 2007 and has been recently enhanced to add new functionality and increase the number of databases it covers. Protein subsequences can be Basic Local Alignment Search Tool (BLAST) against the UniProt Knowledgebase (UniProtKB) to provide an entry point to the standard PICR mapping algorithm. In addition, gene identifiers from UniProtKB and Ensembl can now be submitted as input or mapped to as output from PICR. We have also implemented a ‘best-guess’ mapping algorithm for UniProt. In this article, we describe the usefulness of PICR, how these changes have been implemented, and the corresponding additions to the web services. Finally, we explain that the number of source databases covered by PICR has increased from the initial 73 to the current 102. New resources include several new species-specific Ensembl databases as well as the Ensembl Genome ones. PICR can be accessed at http://www.ebi.ac.uk/Tools/picr/.
We report the release of mzIdentML, an exchange standard for peptide and protein identification data, designed by the Proteomics Standards Initiative. The format was developed by the Proteomics Standards Initiative in collaboration with instrument and software vendors, and the developers of the major open-source projects in proteomics. Software implementations have been developed to enable conversion from most popular proprietary and open-source formats, and mzIdentML will soon be supported by the major public repositories. These developments enable proteomics scientists to start working with the standard for exchanging and publishing data sets in support of publications and they provide a stable platform for bioinformatics groups and commercial software vendors to work with a single file format for identification data.
The PRoteomics IDEntifications (PRIDE) database is a large public proteomics data repository, containing over 270 million mass spectra (by November 2011). PRIDE is an archival database, providing the proteomics data supporting specific scientific publications in a computationally accessible manner. While PRIDE faces rapid increases in data deposition size as well as number of depositions, the major challenge is to ensure a high quality of data depositions in the context of highly diverse proteomics work flows and data representations. Here, we describe the PRIDE curation pipeline and its practical application in quality control of complex data depositions.
IntAct is an open-source, open data molecular interaction database populated by data either curated from the literature or from direct data depositions. Two levels of curation are now available within the database, with both IMEx-level annotation and less detailed MIMIx-compatible entries currently supported. As from September 2011, IntAct contains approximately 275 000 curated binary interaction evidences from over 5000 publications. The IntAct website has been improved to enhance the search process and in particular the graphical display of the results. New data download formats are also available, which will facilitate the inclusion of IntAct's data in the Semantic Web. IntAct is an active contributor to the IMEx consortium (http://www.imexconsortium.org). IntAct source code and data are freely available at http://www.ebi.ac.uk/intact.
The reversible phosphorylation of serine, threonine and tyrosine hydroxyl groups is an especially prominent form of post-translational modification (PTM) of proteins. It plays critical roles in the regulation of diverse processes, and mutations that directly or indirectly affect these phosphorylation events have been associated with many cancers and other pathologies. Here, we describe the development of a new BioMart tool that gathers data from three different biological resources to provide the user with an integrated view of phosphorylation events associated with a human protein of interest, the complexes of which the protein (modified or not) is a part, the reactions in which the protein and its complexes participate and the somatic mutations that might be expected to perturb those functions. The three resources used are the Reactome, PRIDE and COSMIC databases. The Reactome knowledgebase contains annotations of phosphorylated human proteins linked to the reactions in which they are phosphorylated and dephosphorylated, to the complexes of which they are parts and to the reactions in which the phosphorylated proteins participate as substrates, catalysts and regulators. The PRIDE database holds extensive mass spectrometry data from which protein phosphorylation patterns can be inferred, and the COSMIC database holds records of somatic mutations found in human cancer cells. This tool supports both flexible, user-specified queries and standard (‘canned’) queries to retrieve frequently used combinations of data for user-specified proteins and reactions. We demonstrate using the Wnt signaling pathway and the human c-SRC protein how the tool can be used to place somatic mutation data into a functional perspective by changing critical residues involved in pathway modulation, and where available, check for mass spectrometry evidence in PRIDE supporting identification of the critical residue.
Database URL: http://www.reactome.org/cgi-bin/mart
Reactome is an open source, expert-authored, manually curated and peer-reviewed database of reactions, pathways and biological processes. We provide an intuitive web-based user interface to pathway knowledge and a suite of data analysis tools. The Reactome BioMart provides biologists and bioinformaticians with a single web interface for performing simple or elaborate queries of the Reactome database, aggregating data from different sources and providing an opportunity to integrate experimental and computational results with information relating to biological pathways.
Database URL: http://www.reactome.org
The Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI) has developed the GelML data exchange format for representing gel electrophoresis experiments performed in proteomics investigations. The format closely follows the reporting guidelines for gel electrophoresis, which are part of the Minimum Information About a Proteomics Experiment (MIAPE) set of modules. GelML supports the capture of metadata (such as experimental protocols) and data (such as gel images) resulting from gel electrophoresis so that laboratories can be compliant with the MIAPE Gel Electrophoresis guidelines, while allowing such data sets to be exchanged or downloaded from public repositories. The format is sufficiently flexible to capture data from a broad range of experimental processes, and complements other PSI formats for mass spectrometry data and the results of protein and peptide identifications to capture entire gel-based proteome workflows. GelML has resulted from the open standardisation process of PSI consisting of both public consultation and anonymous review of the specifications.
data standard; gel electrophoresis; database; ontology
Motivation: Dasty3 is a highly interactive and extensible Web-based framework. It provides a rich Application Programming Interface upon which it is possible to develop specialized clients capable of retrieving information from DAS sources as well as from data providers not using the DAS protocol. Dasty3 provides significant improvements on previous Web-based frameworks and is implemented using the 1.6 DAS specification.
Availability: Dasty3 is an open-source tool freely available at http://www.ebi.ac.uk/dasty/ under the terms of the GNU General public license. Source and documentation can be found at http://code.google.com/p/dasty/.
In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time.
To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier.
Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data.
Centralised resources such as GenBank and UniProt are perfect examples of the major international efforts that have been made to integrate and share biological information. However, additional data that adds value to these resources needs a simple and rapid route to public access. The Distributed Annotation System (DAS) provides an adequate environment to integrate genomic and proteomic information from multiple sources, making this information accessible to the community. DAS offers a way to distribute and access information but it does not provide domain experts with the mechanisms to participate in the curation process of the available biological entities and their annotations.
We designed and developed a Collaborative Annotation System for proteins called DAS Writeback. DAS writeback is a protocol extension of DAS to provide the functionalities of adding, editing and deleting annotations. We implemented this new specification as extensions of both a DAS server and a DAS client. The architecture was designed with the involvement of the DAS community and it was improved after performing usability experiments emulating a real annotation task.
We demonstrate that DAS Writeback is effective, usable and will provide the appropriate environment for the creation and evolution of community protein annotation.
The Distributed Annotation System (DAS) has proven to be a successful way to publish and share biological data. Although there are more than 750 active registered servers from around 50 organizations, setting up a DAS server comprises a fair amount of work, making it difficult for many research groups to share their biological annotations. Given the clear advantage that the generalized sharing of relevant biological data is for the research community it would be desirable to facilitate the sharing process.
Here we present easyDAS, a web-based system enabling anyone to publish biological annotations with just some clicks. The system, available at http://www.ebi.ac.uk/panda-srv/easydas is capable of reading different standard data file formats, process the data and create a new publicly available DAS source in a completely automated way. The created sources are hosted on the EBI systems and can take advantage of its high storage capacity and network connection, freeing the data provider from any network management work. easyDAS is an open source project under the GNU LGPL license.
easyDAS is an automated DAS source creation system which can help many researchers in sharing their biological data, potentially increasing the amount of relevant biological data available to the scientific community.
This report summarizes the proceedings of the second workshop of the ‘Minimum Information for Biological and Biomedical Investigations’ (MIBBI) consortium held on Dec 1-2, 2010 in Rüdesheim, Germany through the sponsorship of the Beilstein-Institute. MIBBI is an umbrella organization uniting communities developing Minimum Information (MI) checklists to standardize the description of data sets, the workflows by which they were generated and the scientific context for the work. This workshop brought together representatives of more than twenty communities to present the status of their MI checklists and plans for future development. Shared challenges and solutions were identified and the role of MIBBI in MI checklist development was discussed. The meeting featured some thirty presentations, wide-ranging discussions and breakout groups. The top outcomes of the two-day workshop as defined by the participants were: 1) the chance to share best practices and to identify areas of synergy; 2) defining a series of tasks for updating the MIBBI Portal; 3) reemphasizing the need to maintain independent MI checklists for various communities while leveraging common terms and workflow elements contained in multiple checklists; and 4) revision of the concept of the MIBBI Foundry to focus on the creation of a core set of MIBBI modules intended for reuse by individual MI checklist projects while maintaining the integrity of each MI project. Further information about MIBBI and its range of activities can be found at http://mibbi.org/.
Reactome (http://www.reactome.org) is a collaboration among groups at the Ontario Institute for Cancer Research, Cold Spring Harbor Laboratory, New York University School of Medicine and The European Bioinformatics Institute, to develop an open source curated bioinformatics database of human pathways and reactions. Recently, we developed a new web site with improved tools for pathway browsing and data analysis. The Pathway Browser is an Systems Biology Graphical Notation (SBGN)-based visualization system that supports zooming, scrolling and event highlighting. It exploits PSIQUIC web services to overlay our curated pathways with molecular interaction data from the Reactome Functional Interaction Network and external interaction databases such as IntAct, BioGRID, ChEMBL, iRefIndex, MINT and STRING. Our Pathway and Expression Analysis tools enable ID mapping, pathway assignment and overrepresentation analysis of user-supplied data sets. To support pathway annotation and analysis in other species, we continue to make orthology-based inferences of pathways in non-human species, applying Ensembl Compara to identify orthologs of curated human proteins in each of 20 other species. The resulting inferred pathway sets can be browsed and analyzed with our Species Comparison tool. Collaborations are also underway to create manually curated data sets on the Reactome framework for chicken, Drosophila and rice.
Mass spectrometry is a fundamental tool for discovery and analysis in the life sciences. With the rapid advances in mass spectrometry technology and methods, it has become imperative to provide a standard output format for mass spectrometry data that will facilitate data sharing and analysis. Initially, the efforts to develop a standard format for mass spectrometry data resulted in multiple formats, each designed with a different underlying philosophy. To resolve the issues associated with having multiple formats, vendors, researchers, and software developers convened under the banner of the HUPO PSI to develop a single standard. The new data format incorporated many of the desirable technical attributes from the previous data formats, while adding a number of improvements, including features such as a controlled vocabulary with validation tools to ensure consistent usage of the format, improved support for selected reaction monitoring data, and immediately available implementations to facilitate rapid adoption by the community. The resulting standard data format, mzML, is a well tested open-source format for mass spectrometer output files that can be readily utilized by the community and easily adapted for incremental advances in mass spectrometry technology.
The advent of sequencing and structural genomics projects has provided a dramatic boost in the number of protein structures and sequences. Due to the high-throughput nature of these projects, many of the molecules are uncharacterised and their functions unknown. This, in turn, has led to the need for a greater number and diversity of tools and databases providing annotation through transfer based on homology and prediction methods. Though many such tools to annotate protein sequence and structure exist, they are spread throughout the world, often with dedicated individual web pages. This situation does not provide a consensus view of the data and hinders comparison between methods. Integration of these methods is needed. So far this has not been possible since there was no common vocabulary available that could be used as a standard language. A variety of terms could be used to describe any particular feature ranging from different spellings to completely different terms. The Protein Feature Ontology (http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS) is a structured controlled vocabulary for features of a protein sequence or structure. It provides a common language for tools and methods to use, so that integration and comparison of their annotations is possible. The Protein Feature Ontology comprises approximately 100 positional terms (located in a particular region of the sequence), which have been integrated into the Sequence Ontology (SO). 40 non-positional terms which describe general protein properties have also been defined and, in addition, post-translational modifications are described by using an already existing ontology, the Protein Modification Ontology (MOD). The Protein Feature Ontology has been used by the BioSapiens Network of Excellence, a consortium comprising 19 partner sites in 14 European countries generating over 150 distinct annotation types for protein sequences and structures.