Proteomic techniques allow researchers to perform detailed analyses of cellular states and many studies are published each year, which highlight large numbers of proteins quantified in different samples. However, currently few data sets make it into public databases with sufficient metadata to allow other groups to verify findings, perform data mining or integrate different data sets. The Proteomics Standards Initiative has released a series of "Minimum Information About a Proteomics Experiment" guideline documents (MIAPE modules) and accompanying data exchange formats. This article focuses on proteomic studies based on gel electrophoresis and demonstrates how the corresponding MIAPE modules can be fulfilled and data deposited in public databases, using a new experimental data set as an example.
We have performed a study of the effects of an anabolic agent (salbutamol) at two different time points on the protein complement of rat skeletal muscle cells, quantified by difference gel electrophoresis. In the DIGE study, a total of 31 non-redundant proteins were identified as being potentially modulated at 24 h post treatment and 110 non redundant proteins at 96 h post-treatment. Several categories of function have been highlighted as strongly enriched, providing candidate proteins for further study. We also use the study as an example of best practice for data deposition.
We have deposited all data sets from this study in public databases for further analysis by the community. We also describe more generally how gel-based protein identification data sets can now be deposited in the PRoteomics IDEntifications database (PRIDE), using a new software tool, the PRIDESpotMapper, which we developed to work in conjunction with the PRIDE Converter application. We also demonstrate how the ProteoRed MIAPE generator tool can be used to create and share a complete and compliant set of MIAPE reports for this experiment and others.
Despite the complete determination of the genome sequence of a huge number of bacteria, their proteomes remain relatively poorly defined. Beside new methods to increase the number of identified proteins new database applications are necessary to store and present results of large- scale proteomics experiments.
In the present study, a database concept has been developed to address these issues and to offer complete information via a web interface. In our concept, the Oracle based data repository system SQL-LIMS plays the central role in the proteomics workflow and was applied to the proteomes of Mycobacterium tuberculosis, Helicobacter pylori, Salmonella typhimurium and protein complexes such as 20S proteasome. Technical operations of our proteomics labs were used as the standard for SQL-LIMS template creation. By means of a Java based data parser, post-processed data of different approaches, such as LC/ESI-MS, MALDI-MS and 2-D gel electrophoresis (2-DE), were stored in SQL-LIMS. A minimum set of the proteomics data were transferred in our public 2D-PAGE database using a Java based interface (Data Transfer Tool) with the requirements of the PEDRo standardization. Furthermore, the stored proteomics data were extractable out of SQL-LIMS via XML.
The Oracle based data repository system SQL-LIMS played the central role in the proteomics workflow concept. Technical operations of our proteomics labs were used as standards for SQL-LIMS templates. Using a Java based parser, post-processed data of different approaches such as LC/ESI-MS, MALDI-MS and 1-DE and 2-DE were stored in SQL-LIMS. Thus, unique data formats of different instruments were unified and stored in SQL-LIMS tables. Moreover, a unique submission identifier allowed fast access to all experimental data. This was the main advantage compared to multi software solutions, especially if personnel fluctuations are high. Moreover, large scale and high-throughput experiments must be managed in a comprehensive repository system such as SQL-LIMS, to query results in a systematic manner. On the other hand, these database systems are expensive and require at least one full time administrator and specialized lab manager. Moreover, the high technical dynamics in proteomics may cause problems to adjust new data formats. To summarize, SQL-LIMS met the requirements of proteomics data handling especially in skilled processes such as gel-electrophoresis or mass spectrometry and fulfilled the PSI standardization criteria. The data transfer into a public domain via DTT facilitated validation of proteomics data. Additionally, evaluation of mass spectra by post-processing using MS-Screener improved the reliability of mass analysis and prevented storage of data junk.
This paper focuses on the use of controlled vocabularies (CVs) and ontologies especially in the area of proteomics, primarily related to the work of the Proteomics Standards Initiative (PSI). It describes the relevant proteomics standard formats and the ontologies used within them. Software and tools for working with these ontology files are also discussed. The article also examines the “mapping files” used to ensure correct controlled vocabulary terms that are placed within PSI standards and the fulfillment of the MIAPE (Minimum Information about a Proteomics Experiment) requirements. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era. Guest Editors: Martin Eisenacher and Christian Stephan.
► The semantic annotation using ontologies is a prerequisite for the semantic web. ► The HUPO-PSI defined a set of XML-based standard formats for proteomics. ► These standard formats allow the referencing of CV terms defined in obo files. ► The CV terms can be used to enforce MIAPE compliance of the data files. ► The mass spectrometry CV is constantly maintained in a community process.
ANDI-MS, Analytical Data Interchange format for Mass Spectrometry; AniML, Analytical Information Markup Language; API, Application Programming Interface; ASCII, American Standard Code for Information Interchange; ASTM, American Society for Testing and Materials; BTO, BRENDA (BRaunschweig ENzyme DAtabase) Tissue Ontology; ChEBI, Chemical Entities of Biological Interest; CV, Controlled Vocabulary; DL, Description Logic; EBI, European Bioinformatics Institute; HDF5, Hierarchical Data Format, version 5; HUPO-PSI, Human Proteome Organisation-Proteomics Standards Initiative; ICD, International Classification of Diseases; IUPAC, International Union for Pure and Applied Chemistry; JCAMP-DX, Joint Committee on Atomic and Molecular Physical data-Data eXchange format; MALDI, Matrix Assisted Laser Desorption Ionization; MeSH, Medical Subject Headings; MI, Molecular Interaction; MIBBI, Minimal Information for Biological and Biomedical Investigations; MITAB, Molecular Interactions TABular format; MIAPE, Minimum Information About a Proteomics Experiment; MS, Mass Spectrometry; NCBI, National Center for Biotechnology Information; NCBO, National Center for Biomedical Ontology; netCDF, Network Common Data Format; OBI, Ontology for Biomedical Investigations; OBO, Open Biological and Biomedical Ontologies; OLS, Ontology Lookup Service; OWL, Web Ontology Language; PAR, Protein Affinity Reagents; PATO, Phenotype Attribute Trait Ontology; PRIDE, PRoteomics IDEntifications database; RDF(S), Resource Description Framework (Schema); SRM, Selected Reaction Monitoring; TPP, Trans-Proteomic Pipeline; URI, Uniform Resource Identifier; XSLT, eXtensible Stylesheet Language Transformation; YAFMS, Yet Another Format for Mass Spectrometry; Proteomics data standards; Controlled vocabularies; Ontologies in proteomics; Ontology formats; Ontology editors and software; Ontology maintenance
Molecular interaction Information is a key resource in modern biomedical research. Publicly available data have previously been provided in a broad array of diverse formats, making access to this very difficult. The publication and wide implementation of the Human Proteome Organisation Proteomics Standards Initiative Molecular Interactions (HUPO PSI-MI) format in 2004 was a major step towards the establishment of a single, unified format by which molecular interactions should be presented, but focused purely on protein-protein interactions.
The HUPO-PSI has further developed the PSI-MI XML schema to enable the description of interactions between a wider range of molecular types, for example nucleic acids, chemical entities, and molecular complexes. Extensive details about each supported molecular interaction can now be captured, including the biological role of each molecule within that interaction, detailed description of interacting domains, and the kinetic parameters of the interaction. The format is supported by data management and analysis tools and has been adopted by major interaction data providers. Additionally, a simpler, tab-delimited format MITAB2.5 has been developed for the benefit of users who require only minimal information in an easy to access configuration.
The PSI-MI XML2.5 and MITAB2.5 formats have been jointly developed by interaction data producers and providers from both the academic and commercial sector, and are already widely implemented and well supported by an active development community. PSI-MI XML2.5 enables the description of highly detailed molecular interaction data and facilitates data exchange between databases and users without loss of information. MITAB2.5 is a simpler format appropriate for fast Perl parsing or loading into Microsoft Excel.
Many proteomics initiatives require integration of all information with uniformcriteria from collection of samples and data display to publication of experimental results. The integration and exchanging of these data of different formats and structure imposes a great challenge to us. The XML technology presents a promise in handling this task due to its simplicity and flexibility. Nasopharyngeal carcinoma (NPC) is one of the most common cancers in southern China and Southeast Asia, which has marked geographic and racial differences in incidence. Although there are some cancer proteome databases now, there is still no NPC proteome database.
The raw NPC proteome experiment data were captured into one XML document with Human Proteome Markup Language (HUP-ML) editor and imported into native XML database Xindice. The 2D/MS repository of NPC proteome was constructed with Apache, PHP and Xindice to provide access to the database via Internet. On our website, two methods, keyword query and click query, were provided at the same time to access the entries of the NPC proteome database.
Our 2D/MS repository can be used to share the raw NPC proteomics data that are generated from gel-based proteomics experiments. The database, as well as the PHP source codes for constructing users' own proteome repository, can be accessed at .
The global analysis of proteins is now feasible due to improvements in techniques such as two-dimensional gel electrophoresis (2-DE), mass spectrometry, yeast two-hybrid
systems and the development of bioinformatics applications. The experiments form
the basis of proteomics, and present significant challenges in data analysis, storage and
querying. We argue that a standard format for proteome data is required to enable
the storage, exchange and subsequent re-analysis of large datasets. We describe the
criteria that must be met for the development of a standard for proteomics. We have
developed a model to represent data from 2-DE experiments, including difference
gel electrophoresis along with image analysis and statistical analysis across multiple
gels. This part of proteomics analysis is not represented in current proposals for
proteomics standards. We are working with the Proteomics Standards Initiative to
develop a model encompassing biological sample origin, experimental protocols, a
number of separation techniques and mass spectrometry. The standard format will
facilitate the development of central repositories of data, enabling results to be verified
or re-analysed, and the correlation of results produced by different research groups
using a variety of laboratory techniques.
The Human Proteome Organisation Proteomics Standards Initiative (HUPO-PSI) was established in 2002 with the aim of defining community standards for data representation in proteomics and facilitating data comparison, exchange and verification. Over the last 10 years significant advances have been made, with common data standards now published and implemented in the field of both mass spectrometry and molecular interactions. The 2012 meeting further advanced this work, with the mass spectrometry groups finalising approaches to capturing the output from recent developments in the field, such as quantitative proteomics and SRM. The molecular interaction group focused on improving the integration of data from multiple resources. Both groups united with a guest work track, organized by the HUPO Technology/Standards Committee, to formulate proposals for data submissions from the HUPO Human Proteome Project and to start an initiative to collect standard experimental protocols.
The Human Proteome Organisation — Proteomics Standards Initiative (HUPO-PSI) has been working for ten years on the development of standardised formats that facilitate data sharing and public database deposition. In this article, we review three HUPO-PSI data standards — mzML, mzIdentML and mzQuantML, which can be used to design a complete quantitative analysis pipeline in mass spectrometry (MS)-based proteomics. In this tutorial, we briefly describe the content of each data model, sufficient for bioinformaticians to devise proteomics software. We also provide guidance on the use of recently released application programming interfaces (APIs) developed in Java for each of these standards, which makes it straightforward to read and write files of any size. We have produced a set of example Java classes and a basic graphical user interface to demonstrate how to use the most important parts of the PSI standards, available from http://code.google.com/p/psi-standard-formats-tutorial. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era. Guest Editors: Martin Eisenacher and Christian Stephan.
•A tutorial to help software developers use PSI standard formats.•A description of programming interfaces and tools available.•Code snippets and a basic graphical interface to assist understanding.
Quantitative proteomics; Software; Standard formats; APIs
The ProteoRed multicentric experiment was designed to test each laboratory abilities to perform quantitative proteomic analysis, compare methodologies and inter-lab reproducibility for relative quantitative analysis of proteomes and to evaluate data reporting and data sharing tools (MIAPE documents, standard formats, public repositories). The experiment consist in analyzing two different samples (A and B), which contain an identical matrix of E. Coli proteins plus four standard proteins (CYC_HORSE, MYG_HORSE, ALDOA_RABIT, BSA_BOVIN), spiked in different amounts. The samples are designed primarily to be analyzed by LC-MS, although DIGE analysis could be also possible. Each lab will have the choice to test their preferred method for quantitative comparison of the two samples. However, to set as much standardization and reproducibility as possible in terms of data analysis, data sharing, protocols information, results and reporting we propose the OmicsHub Proteomics server to be the single platform to integrate the protein identification steps of the MS multicentric experiment and serve as a web repository. After the “In-lab” analysis is performed, with the laboratory's own tools, every lab is able to load its experiment (protocols, parameters, instruments, etc.) and import its spectrum data via web into the OmicsHub Proteomics analysis and managment server. Every experiment in OmicsHub is automatically stored following the PRIDE standard format. The OmicsHub Proteomics software tool performs the workflow tasks of Protein identification (using the search engines Mascot and Phenyx), Protein annotation, Protein grouping, FDR filtering (allowing the use of a local decoy database, designed ad hoc for this experiment) and Reporting of the protein identification results in a systematic and centralized manner. The OmicsHub Proteomics allows the researchers at ProteoRed consortium to perform its multicentric study with full reproducibility, standardization and experiment comparison; reducing time and data management complexity prior to the final quantification analysis.
In today’s proteomics research, various techniques and instrumentation bioinformatics tools are necessary to manage the large amount of heterogeneous data with an automatic quality control to produce reliable and comparable results. Therefore a data-processing pipeline is mandatory for data validation and comparison in a data-warehousing system. The proteome bioinformatics platform ProteinScape has been proven to cover these needs. The reprocessing of HUPO BPP participants’ MS data was done within ProteinScape. The reprocessed information was transferred into the global data repository PRIDE.
ProteinScape as a data-warehousing system covers two main aspects: archiving relevant data of the proteomics workflow and information extraction functionality (protein identification, quantification and generation of biological knowledge). As a strategy for automatic data validation, different protein search engines are integrated. Result analysis is performed using a decoy database search strategy, which allows the measurement of the false-positive identification rate. Peptide identifications across different workflows, different MS techniques, and different search engines are merged to obtain a quality-controlled protein list.
The proteomics identifications database (PRIDE), as a public data repository, is an archiving system where data are finally stored and no longer changed by further processing steps. Data submission to PRIDE is open to proteomics laboratories generating protein and peptide identifications. An export tool has been developed for transferring all relevant HUPO BPP data from ProteinScape into PRIDE using the PRIDE.xml format.
The EU-funded ProDac project will coordinate the development of software tools covering international standards for the representation of proteomics data. The implementation of data submission pipelines and systematic data collection in public standards–compliant repositories will cover all aspects, from the generation of MS data in each laboratory to the conversion of all the annotating information and identifications to a standardized format. Such datasets can be used in the course of publishing in scientific journals.
Proteomics is rapidly evolving into a high-throughput technology, in which substantial and systematic studies are conducted on samples from a wide range of physiological, developmental, or pathological conditions. Reference maps from 2D gels are widely circulated. However, there is, as yet, no formally accepted standard representation to support the sharing of proteomics data, and little systematic dissemination of comprehensive proteomic data sets.
This paper describes the design, implementation and use of a Proteome Experimental Data Repository (PEDRo), which makes comprehensive proteomics data sets available for browsing, searching and downloading. It is also serves to extend the debate on the level of detail at which proteomics data should be captured, the sorts of facilities that should be provided by proteome data management systems, and the techniques by which such facilities can be made available.
The PEDRo database provides access to a collection of comprehensive descriptions of experimental data sets in proteomics. Not only are these data sets interesting in and of themselves, they also provide a useful early validation of the PEDRo data model, which has served as a starting point for the ongoing standardisation activity through the Proteome Standards Initiative of the Human Proteome Organisation.
The range of heterogeneous approaches available for quantifying protein abundance via mass spectrometry (MS)1 leads to considerable challenges in modeling, archiving, exchanging, or submitting experimental data sets as supplemental material to journals. To date, there has been no widely accepted format for capturing the evidence trail of how quantitative analysis has been performed by software, for transferring data between software packages, or for submitting to public databases. In the context of the Proteomics Standards Initiative, we have developed the mzQuantML data standard. The standard can represent quantitative data about regions in two-dimensional retention time versus mass/charge space (called features), peptides, and proteins and protein groups (where there is ambiguity regarding peptide-to-protein inference), and it offers limited support for small molecule (metabolomic) data. The format has structures for representing replicate MS runs, grouping of replicates (for example, as study variables), and capturing the parameters used by software packages to arrive at these values. The format has the capability to reference other standards such as mzML and mzIdentML, and thus the evidence trail for the MS workflow as a whole can now be described. Several software implementations are available, and we encourage other bioinformatics groups to use mzQuantML as an input, internal, or output format for quantitative software and for structuring local repositories. All project resources are available in the public domain from the HUPO Proteomics Standards Initiative http://www.psidev.info/mzquantml.
Mass spectrometry-based protein identification methods are fundamental to proteomics. Biological experiments are usually performed in replicates and proteomic analyses generate huge datasets which need to be integrated and quantitatively analyzed. The Sequest™ search algorithm is a commonly used algorithm for identifying peptides and proteins from two dimensional liquid chromatography electrospray ionization tandem mass spectrometry (2-D LC ESI MS2) data. A number of proteomic pipelines that facilitate high throughput 'post data acquisition analysis' are described in the literature. However, these pipelines need to be updated to accommodate the rapidly evolving data analysis methods. Here, we describe a proteomic data analysis pipeline that specifically addresses two main issues pertinent to protein identification and differential expression analysis: 1) estimation of the probability of peptide and protein identifications and 2) non-parametric statistics for protein differential expression analysis. Our proteomic analysis workflow analyzes replicate datasets from a single experimental paradigm to generate a list of identified proteins with their probabilities and significant changes in protein expression using parametric and non-parametric statistics.
The input for our workflow is Bioworks™ 3.2 Sequest (or a later version, including cluster) output in XML format. We use a decoy database approach to assign probability to peptide identifications. The user has the option to select "quality thresholds" on peptide identifications based on the P value. We also estimate probability for protein identification. Proteins identified with peptides at a user-specified threshold value from biological experiments are grouped as either control or treatment for further analysis in ProtQuant. ProtQuant utilizes a parametric (ANOVA) method, for calculating differences in protein expression based on the quantitative measure ΣXcorr. Alternatively ProtQuant output can be further processed using non-parametric Monte-Carlo resampling statistics to calculate P values for differential expression. Correction for multiple testing of ANOVA and resampling P values is done using Benjamini and Hochberg's method. The results of these statistical analyses are then combined into a single output file containing a comprehensive protein list with probabilities and differential expression analysis, associated P values, and resampling statistics.
For biologists carrying out proteomics by mass spectrometry, our workflow facilitates automated, easy to use analyses of Bioworks (3.2 or later versions) data. All the methods used in the workflow are peer-reviewed and as such the results of our workflow are compliant with proteomic data submission guidelines to public proteomic data repositories including PRIDE. Our workflow is a necessary intermediate step that is required to link proteomics data to biological knowledge for generating testable hypotheses.
We have developed an integrated web-accessible software system called the Yale Protein Expression Database (YPED) to address the need for storage, retrieval, and integrated analysis of large amounts of data from high throughput proteomic technologies. YPED is an open source system which integrates gel analysis results with protein identifications from DIGE experiments. The system associates the DIGE gel spots and image, analyzed with DeCyder, with mass spectrometric protein identifications from selected gel spots. Following in gel trypsin digestion, proteins in spots of interest are analyzed using MALDI-TOF/TOF on an AB 4700 or, more recently, on an AB 4800 with protein identifications performed by Mascot in conjunction with the AB GPS Explorer system. In addition to DIGE, YPED currently handles protein identifications from MudPIT, iTRAQ, and ICAT experiments. Sample descriptions are compatible with the evolving MIAPE standards. Tandem MS/MS results from MudPIT, and ICAT analyses are validated with the Trans-Proteomic Pipeline and then stored in the database for viewing and linking to the identified proteins. Researchers can view, subset, and download their data through a secure Web interface that includes a table containing proteins identified, a sample summary, the sample description, and a clickable gel image for DIGE samples. Tools are available to facilitate sample comparison and the viewing of phosphoproteins. A summary report with PANTHER Classification System annotations is also available to aid in biological interpretation of the results. The source code is open-source and is available from http://yped.med.yale.edu/yped_dist.
protein expression database; mass spectrometry; MudPIT; DIGE; ICAT; iTRAQ; proteomics
Many proteomics initiatives require a seamless bioinformatics integration of a range of analytical steps between sample collection and systems modeling immediately assessable to the participants involved in the process. Proteomics profiling by 2D gel electrophoresis to the putative identification of differentially expressed proteins by comparison of mass spectrometry results with reference databases, includes many components of sample processing, not just analysis and interpretation, are regularly revisited and updated. In order for such updates and dissemination of data, a suitable data structure is needed. However, there are no such data structures currently available for the storing of data for multiple gels generated through a single proteomic experiments in a single XML file. This paper proposes a data structure based on XML standards to fill the void that exists between data generated by proteomics experiments and storing of data.
In order to address the resulting procedural fluidity we have adopted and implemented a data model centered on the concept of annotated gel (AG) as the format for delivery and management of 2D Gel electrophoresis results. An eXtensible Markup Language (XML) schema is proposed to manage, analyze and disseminate annotated 2D Gel electrophoresis results. The structure of AG objects is formally represented using XML, resulting in the definition of the AGML syntax presented here.
The proposed schema accommodates data on the electrophoresis results as well as the mass-spectrometry analysis of selected gel spots. A web-based software library is being developed to handle data storage, analysis and graphic representation. Computational tools described will be made available at . Our development of AGML provides a simple data structure for storing 2D gel electrophoresis data.
Flow cytometry technology is widely used in both health care and research. The rapid expansion of flow cytometry applications has outpaced the development of data storage and analysis tools. Collaborative efforts being taken to eliminate this gap include building common vocabularies and ontologies, designing generic data models, and defining data exchange formats. The Minimum Information about a Flow Cytometry Experiment (MIFlowCyt) standard was recently adopted by the International Society for Advancement of Cytometry. This standard guides researchers on the information that should be included in peer reviewed publications, but it is insufficient for data exchange and integration between computational systems. The Functional Genomics Experiment (FuGE) formalizes common aspects of comprehensive and high throughput experiments across different biological technologies. We have extended FuGE object model to accommodate flow cytometry data and metadata.
We used the MagicDraw modelling tool to design a UML model (Flow-OM) according to the FuGE extension guidelines and the AndroMDA toolkit to transform the model to a markup language (Flow-ML). We mapped each MIFlowCyt term to either an existing FuGE class or to a new FuGEFlow class. The development environment was validated by comparing the official FuGE XSD to the schema we generated from the FuGE object model using our configuration. After the Flow-OM model was completed, the final version of the Flow-ML was generated and validated against an example MIFlowCyt compliant experiment description.
The extension of FuGE for flow cytometry has resulted in a generic FuGE-compliant data model (FuGEFlow), which accommodates and links together all information required by MIFlowCyt. The FuGEFlow model can be used to build software and databases using FuGE software toolkits to facilitate automated exchange and manipulation of potentially large flow cytometry experimental data sets. Additional project documentation, including reusable design patterns and a guide for setting up a development environment, was contributed back to the FuGE project.
We have shown that an extension of FuGE can be used to transform minimum information requirements in natural language to markup language in XML. Extending FuGE required significant effort, but in our experiences the benefits outweighed the costs. The FuGEFlow is expected to play a central role in describing flow cytometry experiments and ultimately facilitating data exchange including public flow cytometry repositories currently under development.
One of the primary challenges in translational research data management is breaking down the barriers between the multiple data silos and the integration of 'omics data with clinical information to complete the cycle from the bench to the bedside. The role of contextual metadata, also called provenance information, is a key factor ineffective data integration, reproducibility of results, correct attribution of original source, and answering research queries involving "What", "Where", "When", "Which", "Who", "How", and "Why" (also known as the W7 model). But, at present there is limited or no effective approach to managing and leveraging provenance information for integrating data across studies or projects. Hence, there is an urgent need for a paradigm shift in creating a "provenance-aware" informatics platform to address this challenge. We introduce an ontology-driven, intuitive Semantic Proteomics Dashboard (SemPoD) that uses provenance together with domain information (semantic provenance) to enable researchers to query, compare, and correlate different types of data across multiple projects, and allow integration with legacy data to support their ongoing research.
SemPoD is an intuitive and powerful provenance ontology-driven data access and query platform that uses the MIAPE and MIMIx metadata guideline to create an integrated view over large-scale systems molecular biology datasets. SemPoD leverages the SysPro ontology to create an intuitive dashboard for biologists to compose queries, explore the results, and use a query manager for storing queries for later use. SemPoD can be deployed over many existing database applications storing 'omics data, including, as illustrated here, the LabKey data-management system. The initial user feedback evaluating the usability and functionality of SemPoD has been very positive and it is being considered for wider deployment beyond the proteomics domain, and in other 'omics' centers.
The HUPO Proteomics Standards Initiative has developed several standardized data formats to facilitate data sharing in mass spectrometry (MS)-based proteomics. These allow researchers to report their complete results in a unified way. However, at present, there is no format to describe the final qualitative and quantitative results for proteomics and metabolomics experiments in a simple tabular format. Many downstream analysis use cases are only concerned with the final results of an experiment and require an easily accessible format, compatible with tools such as Microsoft Excel or R.
We developed the mzTab file format for MS-based proteomics and metabolomics results to meet this need. mzTab is intended as a lightweight supplement to the existing standard XML-based file formats (mzML, mzIdentML, mzQuantML), providing a comprehensive summary, similar in concept to the supplemental material of a scientific publication. mzTab files can contain protein, peptide, and small molecule identifications together with experimental metadata and basic quantitative information. The format is not intended to store the complete experimental evidence but provides mechanisms to report results at different levels of detail. These range from a simple summary of the final results to a representation of the results including the experimental design. This format is ideally suited to make MS-based proteomics and metabolomics results available to a wider biological community outside the field of MS. Several software tools for proteomics and metabolomics have already adapted the format as an output format. The comprehensive mzTab specification document and extensive additional documentation can be found online.
In the post-genome era, most research scientists working in the field of proteomics are confronted with difficulties in management of large volumes of data, which they are required to keep in formats suitable for subsequent data mining. Therefore, a well-developed open source laboratory information management system (LIMS) should be available for their proteomics research studies.
We developed an open source LIMS appropriately customized for 2-D gel electrophoresis-based proteomics workflow. The main features of its design are compactness, flexibility and connectivity to public databases. It supports the handling of data imported from mass spectrometry software and 2-D gel image analysis software. The LIMS is equipped with the same input interface for 2-D gel information as a clickable map on public 2DPAGE databases. The LIMS allows researchers to follow their own experimental procedures by reviewing the illustrations of 2-D gel maps and well layouts on the digestion plates and MS sample plates.
Our new open source LIMS is now available as a basic model for proteome informatics, and is accessible for further improvement. We hope that many research scientists working in the field of proteomics will evaluate our LIMS and suggest ways in which it can be improved.
There is increasing interest in the development of computational methods to analyze fluorescent microscopy images and enable automated large-scale analysis of the subcellular localization of proteins. Determining the subcellular localization is an integral part of identifying a protein's function, and the application of bioinformatics to this problem provides a valuable tool for the annotation of proteomes. Training and validating algorithms used in image analysis research typically rely on large sets of image data, and would benefit from a large, well-annotated and highly-available database of images and associated metadata.
The Yeast Resource Center Public Image Repository (YRC PIR) is a large database of images depicting the subcellular localization and colocalization of proteins. Designed especially for computational biologists who need large numbers of images, the YRC PIR contains 532,182 TIFF images from nearly 85,000 separate experiments and their associated experimental data. All images and associated data are searchable, and the results browsable, through an intuitive web interface. Search results, experiments, individual images or the entire dataset may be downloaded as standards-compliant OME-TIFF data.
The YRC PIR is a powerful resource for researchers to find, view, and download many images and associated metadata depicting the subcellular localization and colocalization of proteins, or classes of proteins, in a standards-compliant format. The YRC PIR is freely available at http://images.yeastrc.org/.
OmicsHub Proteomics integrates in one single platform all the steps of a Mass Spectrometry Experiment reducing time and data management complexity. The proteomics data automation and data management/analysis provided by OmicsHub Proteomics solves the typical problems your lab members find on a daily basis and makes life easier when performing tasks such as multiple search engine support, pathways integration or custom report generation for external customers. OmicsHub has been designed as a central data management system to collect, analyze and annotate proteomics experimental data enabling users to automate tasks. OmicsHub Proteomics helps laboratories to easily meet proteomics standards such as PRIDE or FuGE and works with controlled vocabulary experiment annotation. The software enables your lab members to take a greater advantage of the Mascot and Phenyx search engines unique capabilities for protein identification. Multiple searches can be launch at once, allowing peak list data from several spots or chromatograms to be sent concurrently to Mascot/Phenyx. OmicsHub Proteomics works for both LC and Gel workflows. The system allows to store and compare proteomics data generated from different Mass Spectrometry instruments in a single platform instead of having a specific software for each of them. It is a web application which installs in a single server needing just Web Browser to have access to it. All experimental actions are userstamp and datestamp allowing the audit tracking of every action performed in OmicsHub. Some of the OmicsHub Proteomics main features are Protein identification, Biological annotation, Report customization, PRIDE standard, Pathways integration, Group proteins results removing redundancy, Peak filtering and FDR cutoff for decoy databases. OmicsHub Proteomics its flexible enough to parsers for new file formats to be easily imported and fits your budget having a very competitive price for its perpetual license.
The Proteomics Standard Initiative Common QUery InterfaCe (PSICQUIC) specification was
created by the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) to
enable computational access to molecular-interaction data resources by means of a standard
Web Service and query language. Currently providing >150 million binary interaction
evidences from 28 servers globally, the PSICQUIC interface allows the concurrent search of
multiple molecular-interaction information resources using a single query. Here, we
present an extension of the PSICQUIC specification (version 1.3), which has been released
to be compliant with the enhanced standards in molecular interactions. The new release
also includes a new reference implementation of the PSICQUIC server available to the data
providers. It offers augmented web service capabilities and improves the user experience.
PSICQUIC has been running for almost 5 years, with a user base growing from only 4 data
providers to 28 (April 2013) allowing access to 151 310 109 binary interactions. The power
of this web service is shown in PSICQUIC View web application, an example of how to
simultaneously query, browse and download results from the different PSICQUIC servers.
This application is free and open to all users with no login requirement (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml).
In spite of two-dimensional gel electrophoresis (2-DE) being an effective and widely used method to screen the proteome, its data standardization has still not matured to the level of microarray genomics data or mass spectrometry approaches. The trend toward identifying encompassing data standards has been expanding from genomics to transcriptomics, and more recently to proteomics. The relative success of genomic and transcriptomic data standardization has enabled the development of central repositories such as GenBank and Gene Expression Omnibus. An equivalent 2-DE-centric data structure would similarly have to include a balance among raw data, basic feature detection results, sufficiency in the description of the experimental context and methods, and an overall structure that facilitates a diversity of usages, from central reposition to local data representation in LIMs systems.
Results & Conclusion
Achieving such a balance can only be accomplished through several iterations involving bioinformaticians, bench molecular biologists, and the manufacturers of the equipment and commercial software from which the data is primarily generated. Such an encompassing data structure is described here, developed as the mature successor to the well established and broadly used earlier version. A public repository, AGML Central, is configured with a suite of tools for the conversion from a variety of popular formats, web-based visualization, and interoperation with other tools and repositories, and is particularly mass-spectrometry oriented with I/O for annotation and data analysis.
We present 2DDB, a bioinformatics solution for storage, integration and analysis of quantitative proteomics data. As the data complexity and the rate with which it is produced increases in the proteomics field, the need for flexible analysis software increases.
2DDB is based on a core data model describing fundamentals such as experiment description and identified proteins. The extended data models are built on top of the core data model to capture more specific aspects of the data. A number of public databases and bioinformatical tools have been integrated giving the user access to large amounts of relevant data. A statistical and graphical package, R, is used for statistical and graphical analysis. The current implementation handles quantitative data from 2D gel electrophoresis and multidimensional liquid chromatography/mass spectrometry experiments.
The software has successfully been employed in a number of projects ranging from quantitative liquid-chromatography-mass spectrometry based analysis of transforming growth factor-beta stimulated fi-broblasts to 2D gel electrophoresis/mass spectrometry analysis of biopsies from human cervix. The software is available for download at SourceForge.
Two-dimensional electrophoresis is an established method used to study differences in protein expression caused by, for example, a disease state or drug treatment. Conventional methods require the separation of one sample on each individual gel. This approach exposes the data to a high level of system variation, such as gel-to-gel variation caused by experimental factors. Quantification of protein differences can thus be uncertain and lead to false biological conclusions. Two alternatives to reduce this variation are (1) increase the number of replicates or (2) use the Ettan difference gel electrophoresis (DIGE) system. The Ettan DIGE system is a well-established technology in proteomics, which uses an internal standard for between-gel normalization. By pre-labeling samples prior to 2D electrophoresis with three spectrally resolvable CyDye DIGE Fluor dyes, electrophoretic co-migration of three protein samples on the same 2D gel is possible. This approach significantly reduces the number of replicate gels needed to ensure reproducibility and reliability of the differential expression analysis.
Here, we present results from a differential expression analysis experiment performed with Escherichia coli samples grown under different conditions. We demonstrate that by using DIGE and the DeCyder 2D co-detection algorithm, the numbers of replicates are significantly reduced and the system variability is minimized compared to conventional electrophoresis with post-stained gels.