What is the scope of PRIDE?
PRIDE can store
- The title and description of the experiment, together with contact details of the submitter.
- Literature references.
- Protein identifications by accession number supported by a corresponding list of one or more peptide identifications.
- For each peptide identified, the sequence and coordinates of the peptide within the protein that it provides evidence for. Optionally, a reference to any submitted mass spectra that form the evidence for the peptide identification.
- Any post-translational modifications (natural or artefactual) coordinated in relation to the specific peptide that they have been found upon.
- A description of the sample under analysis, including but not limited to the species of origin, tissue, sub-cellular location (if appropriate), disease state and any other relevant annotation.
- A description of the instrumentation used to perform the analysis, including mass spectrometer source, analysers and detector, instrument settings and software settings used in data processing to generate peak lists.
- Processed peak lists supporting the identifications in PRIDE in the versatile PSI mzData format.
PRIDE version 2.0, the release of PRIDE available at the time of writing, makes use of HUPO PSI deliverables such as the mzData XML schema (7
) for capturing the settings and output from mass spectrometry work flow, including items vi–viii listed earlier. At present, the PRIDE XML schema encompasses the mzData schema with additional elements to allow protein and peptide identifications and post-translational modifications to be captured. It is envisaged that the analysisXML XML schema will be incorporated into PRIDE following its first release as a finalized schema, expected by early 2006, replacing large parts of the custom schema currently present in PRIDE.
Datasets currently available in PRIDE
A significant dataset that is publicly available from PRIDE at the time of writing is the set of protein and peptide identifications from the individual laboratories involved in the HUPO Plasma Proteome Project (8
). This project was in part responsible for the requirements statement that initiated the PRIDE project.
Another publicly available dataset in PRIDE is a profile of the human platelet proteome (9
) submitted by the Department of Medical Protein Research, Ghent University. This department is also scheduled to contribute a substantial dataset identifying proteolytic cleavage by caspases in apoptotic Jurkat T-cells (10
) as well as a large set of spectra used to evaluate spectrum quality filtering software (11
A dataset of protein and peptide identifications describing the organelle proteome of the secretory pathway is currently held as private data in PRIDE but is expected to be publicly available following publication of the related manuscript. At present this dataset can only be viewed by prior permission of the submitters.
It is expected that by the end of 2005 PRIDE will also contain the protein and peptide identifications and related mass spectra from the HUPO Brain Proteome Project (12
) as a publicly available dataset.
Submission and retrieval of data
Data can be both submitted to and retrieved from PRIDE through a web interface, using either the PRIDE XML schema, which embeds mzData as a sub-element to allow inclusion of details of the spectra, or using the mzData XML schema, in which case all identifications will be omitted.
Data can also be viewed as a human-readable HTML table illustrated in .
An example of PRIDE data in tabulated HTML format.
illustrates the search page. Queries can include experiment identifier, protein accession or identifier, literature references and sample parameters, including species, tissue, sub-cellular location and disease. The search results include all the experiments that match the query, together with options of how the data should be presented.
Data security in PRIDE: PRIDE as a tool for journal review
Data submitted to PRIDE is marked as public or private. Private data can be shared through a collaborative mechanism that allows individuals to apply to join a collaboration, their application then being confirmed or rejected by the creator of the collaboration. As well as allowing collaborating laboratories to share their data, this mechanism can also be used to allow manuscript reviewers to access the corresponding PRIDE entry in a confidential manner on a neutral site.
Use of controlled vocabularies and ontologies in PRIDE
By extending the mechanism designed for the mzData XML schema, PRIDE makes extensive use of external controlled vocabularies and ontologies (hereafter ‘CVs’) to annotate entries. The use of CVs ensures that queries for particular terms will capture all of the relevant data without omission due to differences in terminology. As a spin-off of the PRIDE development program, a SOAP web service to allow external CVs to be queried in an intelligent manner has been developed at the EBI, initially for use by PRIDE (http://www.ebi.ac.uk/ontology-lookup/
). This service allows queries to take advantage of the hierarchical nature of ontologies. For example, if a user requests all protein identifications found in pancreas
, the relevant Medical Subject Headings (MeSH) term will be looked up in the ontology web service and PRIDE will be queried for entries relating to the MeSH term ‘Pancreas’ as well as all child terms, currently in this case including ‘Islets of Langerhans’, ‘Pancreas, Exocrine’ and ‘Pancreatic Ducts’. This mechanism assists the user by retrieving all the relevant data without the need to have a detailed knowledge of the terms involved.
CVs and ontologies suggested for use in PRIDE include MeSH (13
) for animal anatomy and disease states; Gene Ontology (GO) (14
) for sub-cellular location; NEWT (15
) for taxonomy, which is a superset of the NCBI taxonomy (16
); the mass spectrometry ontology being developed by the HUPO PSI; RESID for naturally occurring post-translational modifications (17
) and UNIMOD for protein modifications encountered in mass spectrometry experiments (18
A PRIDE CV has been created for cases where existing CVs do not include a term required to annotate data in PRIDE.
The use of CVs and ontologies will allow the annotation of certain specific experimental results such as peptide retention times for LC-MS experiments or protein quantitation information for quantitive or differential proteomics experiments. Where the required CV terms do not exist already, PRIDE can accommodate these data elements through the use of user parameters.
PRIDE is an open-source software development project
Care has been taken throughout the development of PRIDE to ensure that all system components are open-source and freely available. PRIDE is written in Java and made available under the open-source Apache license. All the source code are freely available from the CVS repository (http://sourceforge.net/cvs/?group_id=122040
). PRIDE uses the open-source Object-Relational Bridge (OJB) (http://db.apache.org/ojb/
) API for database connectivity. As a consequence, PRIDE can easily be adapted to run on any SQL-based relational database management system. Configuration files exist for both Oracle (http://www.oracle.com
) and MySQL (http://www.mysql.com/