|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oxfordjournals.org
PRIDE, the ‘PRoteomics IDEntifications database’ (http://www.ebi.ac.uk/pride) is a database of protein and peptide identifications that have been described in the scientific literature. These identifications will typically be from specific species, tissues and sub-cellular locations, perhaps under specific disease conditions. Any post-translational modifications that have been identified on individual peptides can be described. These identifications may be annotated with supporting mass spectra. At the time of writing, PRIDE includes the full set of identifications as submitted by individual laboratories participating in the HUPO Plasma Proteome Project and a profile of the human platelet proteome submitted by the University of Ghent in Belgium. By late 2005 PRIDE is expected to contain the identifications and spectra generated by the HUPO Brain Proteome Project. Proteomics laboratories are encouraged to submit their identifications and spectra to PRIDE to support their manuscript submissions to proteomics journals. Data can be submitted in PRIDE XML format if identifications are included or mzData format if the submitter is depositing mass spectra without identifications. PRIDE is a web application, so submission, searching and data retrieval can all be performed using an internet browser. PRIDE can be searched by experiment accession number, protein accession number, literature reference and sample parameters including species, tissue, sub-cellular location and disease state. Data can be retrieved as machine-readable PRIDE or mzData XML (the latter for mass spectra without identifications), or as human-readable HTML.
The vast quantity of data associated with a single proteomics experiment can become problematic at the point of publishing the results. Laboratories tend to publish their work in an appropriate journal with perhaps a PDF document listing the proteins described. If space allows, the individual peptide sequences may be included but there is little possibility of including details of the mass spectra in this format. This clearly creates difficulties while attempting to reproduce the work of a laboratory to confirm their results.
Fortunately, the community has recognized and is tackling this problem through the formation of groups concerned with the development of standards for the capture and sharing of proteomics data. One such group is the HUPO Proteomics Standards Initiative (PSI) (1) who are in the process of developing standards tackling several aspects of proteomics, including ontologies of proteomics related terms, XML schemata and minimal reporting guidelines.
The Proteomics Identifications database (PRIDE), previously described by Martens et al. (2) is a PSI compliant public repository for proteomics identifications to which any proteomics laboratory is welcome to submit data. It is envisaged, but not mandated, that any such submission would normally be in the context of the corresponding submission of a manuscript to a journal describing the identifications submitted to PRIDE. As such, PRIDE aims to become the proteomics equivalent of the ArrayExpress database (3) used to capture microarray experiment data in support of journal publications.
PRIDE is not alone in this endeavor. Several other publicly available databases exist for the purpose of capturing and disseminating proteomics data from mass spectrometry. Such databases include the Global Proteome Machine Database (gpmDB) (4), The Institute for Systems Biology's PeptideAtlas (5) and the University of Texas' Open Proteomics Database (opd) (6). Currently in progress is the development of a collaborative agreement to exchange data between these and other emerging proteomics data repositories, including PRIDE.
PRIDE can store
PRIDE version 2.0, the release of PRIDE available at the time of writing, makes use of HUPO PSI deliverables such as the mzData XML schema (7) for capturing the settings and output from mass spectrometry work flow, including items vi–viii listed earlier. At present, the PRIDE XML schema encompasses the mzData schema with additional elements to allow protein and peptide identifications and post-translational modifications to be captured. It is envisaged that the analysisXML XML schema will be incorporated into PRIDE following its first release as a finalized schema, expected by early 2006, replacing large parts of the custom schema currently present in PRIDE.
A significant dataset that is publicly available from PRIDE at the time of writing is the set of protein and peptide identifications from the individual laboratories involved in the HUPO Plasma Proteome Project (8). This project was in part responsible for the requirements statement that initiated the PRIDE project.
Another publicly available dataset in PRIDE is a profile of the human platelet proteome (9) submitted by the Department of Medical Protein Research, Ghent University. This department is also scheduled to contribute a substantial dataset identifying proteolytic cleavage by caspases in apoptotic Jurkat T-cells (10) as well as a large set of spectra used to evaluate spectrum quality filtering software (11).
A dataset of protein and peptide identifications describing the organelle proteome of the secretory pathway is currently held as private data in PRIDE but is expected to be publicly available following publication of the related manuscript. At present this dataset can only be viewed by prior permission of the submitters.
It is expected that by the end of 2005 PRIDE will also contain the protein and peptide identifications and related mass spectra from the HUPO Brain Proteome Project (12) as a publicly available dataset.
Data can be both submitted to and retrieved from PRIDE through a web interface, using either the PRIDE XML schema, which embeds mzData as a sub-element to allow inclusion of details of the spectra, or using the mzData XML schema, in which case all identifications will be omitted.
Data can also be viewed as a human-readable HTML table illustrated in Figure 1.
Figure 2 illustrates the search page. Queries can include experiment identifier, protein accession or identifier, literature references and sample parameters, including species, tissue, sub-cellular location and disease. The search results include all the experiments that match the query, together with options of how the data should be presented.
Data submitted to PRIDE is marked as public or private. Private data can be shared through a collaborative mechanism that allows individuals to apply to join a collaboration, their application then being confirmed or rejected by the creator of the collaboration. As well as allowing collaborating laboratories to share their data, this mechanism can also be used to allow manuscript reviewers to access the corresponding PRIDE entry in a confidential manner on a neutral site.
By extending the mechanism designed for the mzData XML schema, PRIDE makes extensive use of external controlled vocabularies and ontologies (hereafter ‘CVs’) to annotate entries. The use of CVs ensures that queries for particular terms will capture all of the relevant data without omission due to differences in terminology. As a spin-off of the PRIDE development program, a SOAP web service to allow external CVs to be queried in an intelligent manner has been developed at the EBI, initially for use by PRIDE (http://www.ebi.ac.uk/ontology-lookup/). This service allows queries to take advantage of the hierarchical nature of ontologies. For example, if a user requests all protein identifications found in pancreas, the relevant Medical Subject Headings (MeSH) term will be looked up in the ontology web service and PRIDE will be queried for entries relating to the MeSH term ‘Pancreas’ as well as all child terms, currently in this case including ‘Islets of Langerhans’, ‘Pancreas, Exocrine’ and ‘Pancreatic Ducts’. This mechanism assists the user by retrieving all the relevant data without the need to have a detailed knowledge of the terms involved.
CVs and ontologies suggested for use in PRIDE include MeSH (13) for animal anatomy and disease states; Gene Ontology (GO) (14) for sub-cellular location; NEWT (15) for taxonomy, which is a superset of the NCBI taxonomy (16); the mass spectrometry ontology being developed by the HUPO PSI; RESID for naturally occurring post-translational modifications (17) and UNIMOD for protein modifications encountered in mass spectrometry experiments (18).
A PRIDE CV has been created for cases where existing CVs do not include a term required to annotate data in PRIDE.
The use of CVs and ontologies will allow the annotation of certain specific experimental results such as peptide retention times for LC-MS experiments or protein quantitation information for quantitive or differential proteomics experiments. Where the required CV terms do not exist already, PRIDE can accommodate these data elements through the use of user parameters.
Care has been taken throughout the development of PRIDE to ensure that all system components are open-source and freely available. PRIDE is written in Java and made available under the open-source Apache license. All the source code are freely available from the CVS repository (http://sourceforge.net/cvs/?group_id=122040). PRIDE uses the open-source Object-Relational Bridge (OJB) (http://db.apache.org/ojb/) API for database connectivity. As a consequence, PRIDE can easily be adapted to run on any SQL-based relational database management system. Configuration files exist for both Oracle (http://www.oracle.com) and MySQL (http://www.mysql.com/).
Here we consider possible applications of PRIDE from the perspective of the typical proteomics researcher. PRIDE offers the user several useful query opportunities including:
The developers of PRIDE recognize that the system has room to evolve in several important aspects.
It is important for the future of PRIDE to keep apace with developments in the HUPO PSI. One important development of this initiative is the analysisXML XML schema, designed to hold details of protein and peptide identifications and post-translational modifications, together with cross references to the relevant mzData entries describing spectra. It is intended that analysisXML will be fully supported by PRIDE for import and export, without modification or data loss, as soon as possible after the first stable release of the new analysisXML XML schema.
Submitters of identifications to PRIDE will naturally make use of their favored protein sequence database against which to search their spectra. Consequently, PRIDE will quickly fill with protein accessions and IDs from disparate protein sequence databases. An important short-term goal of the PRIDE project is to map all identifications to the UniProt database (19), including cross references to as many other protein databases as possible. This work will borrow heavily from the IntAct project (20), both in terms of code base and procedures for automatic and human curation.
A long-term goal of the PRIDE project is to provide an automated program of regular re-analysis of mass spectra deposited in PRIDE using the most up-to-date protein sequence databases and available open-source search algorithms such as X!Tandem (http://www.thegpm.org/TANDEM/) (21). The submitter's original identifications would continue to be available as described in the corresponding manuscript.
The EBI has developed a Distributed Annotation Server (DAS) (22) service for PRIDE (http://www.ebi.ac.uk/das-srv/pride/das/) using the BioJava Dazzle servlet (http://www.biojava.org/dazzle/). This service is publicly available and can be used to enable DAS clients such as Dasty (23), designed for visualizing protein sequence and annotation, to display identifying peptides for the protein specified in the DAS request.
PRIDE is supported through BBSRC iSPIDER and HUPO Plasma Proteome Project funding as well as a EU Marie Curie fellowship. L.M. is a research assistant of the Fund for Scientific Research, Flanders (Belgium) (F.W.O. Vlaanderen). L.M. would like to thank Prof Dr Kris Gevaert and Prof Dr Joël Vandekerckhove for their support. Funding to pay the Open Access publication charges for this article was provided by BBSRC iSPIDER and HUPO.
Conflict of interest statement. None declared.