The Human Proteome Organization (HUPO) Proteomics Standard Initiative has been tasked with developing file formats for storing raw data (mzML) and the results of spectral processing (protein identification and quantification) from proteomics experiments (mzIndentML). In order to fully characterize complex experiments, special data types have been designed. Standardized file formats will promote visualization, validation and dissemination of data independent of the vendor-specific binary data storage files. Innovative programmatic solutions for robust and efficient data access to standardized file formats will contribute to more rapid wide-scale acceptance of these file formats by the proteomics community.
In this work, we compare algorithms for accessing spectral data in the mzML file format. As an XML file, mzML files allow efficient parsing of data structures when using XML-specific class types. These classes provide only sequential access to files. However, random access to spectral data is needed in many algorithmic applications for processing proteomics datasets. Here, we demonstrate implementation of memory streams to convert a sequential access into random access. Our application preserves the elegant XML parsing capabilities. Benchmarking file access times in sequential and random access modes show that while for small number of spectra the random access is more time efficient, when retrieving large number of spectra sequential access becomes more efficient. We also provide comparisons to other file accessing methods from academia and industry.
mzML; XML; Sequential file access; Random file access; Proteomics datasets
Proteomics continues to play a critical role in post-genomic science as continued advances in mass spectrometry and analytical chemistry support the separation and identification of increasing numbers of peptides and proteins from their characteristic mass spectra. In order to facilitate the sharing of this data, various standard formats have been, and continue to be, developed. Still not fully mature however, these are not yet able to cope with the increasing number of quantitative proteomic technologies that are being developed.
We propose an extension to the PRIDE and mzData XML schema to accommodate the concept of multiple samples per experiment, and in addition, capture the intensities of the iTRAQTM reporter ions in the entry. A simple Java-client has been developed to capture and convert the raw data from common spectral file formats, which also uses a third-party open source tool for the generation of iTRAQTM reported intensities from Mascot output, into a valid PRIDE XML entry.
We describe an extension to the PRIDE and mzData schemas to enable the capture of quantitative data. Currently this is limited to iTRAQTM data but is readily extensible for other quantitative proteomic technologies. Furthermore, a software tool has been developed which enables conversion from various mass spectrum file formats and corresponding Mascot peptide identifications to PRIDE formatted XML. The tool represents a simple approach to preparing quantitative and qualitative data for submission to repositories such as PRIDE, which is necessary to facilitate data deposition and sharing in public domain database. The software is freely available from .
We have developed the Yale Protein Expression Database (YPED) to address
the storage, retrieval, and integrated analysis of proteomics data generated
by Yale's Keck Protein Chemistry and Mass Spectrometry Facility. YPED
is Web-accessible and currently handles sample requisition, result
reporting and sample comparison for ICAT, DIGE and MUDPIT samples. Sample
descriptions are compatible with the evolving MIAPE standards. Peptides
and proteins identified using Sequest or Mascot are validated
with the Trans-Proteomic Pipeline developed at the Institute
of Systems Biology and data from the resulting XML file are stored in
the database. Researchers can view, subset and download their data through
a secure Web interface.
Many proteomics initiatives require integration of all information with uniformcriteria from collection of samples and data display to publication of experimental results. The integration and exchanging of these data of different formats and structure imposes a great challenge to us. The XML technology presents a promise in handling this task due to its simplicity and flexibility. Nasopharyngeal carcinoma (NPC) is one of the most common cancers in southern China and Southeast Asia, which has marked geographic and racial differences in incidence. Although there are some cancer proteome databases now, there is still no NPC proteome database.
The raw NPC proteome experiment data were captured into one XML document with Human Proteome Markup Language (HUP-ML) editor and imported into native XML database Xindice. The 2D/MS repository of NPC proteome was constructed with Apache, PHP and Xindice to provide access to the database via Internet. On our website, two methods, keyword query and click query, were provided at the same time to access the entries of the NPC proteome database.
Our 2D/MS repository can be used to share the raw NPC proteomics data that are generated from gel-based proteomics experiments. The database, as well as the PHP source codes for constructing users' own proteome repository, can be accessed at .
Proteomics inherently deals with huge amounts of data. Current mass spectrometers acquire hundreds of thousands of spectra within a single project. Thus, data management and data analysis are a challenge. We have developed a software platform (Proteinscape) that stores all relevant proteomics data efficiently and allows fast access and correlation analysis within proteomics projects.
The software is based on a relational database system using Web-based server-client architecture with intra- and Internet access.
Proteinscape stores relevant data from all steps of proteomics projects—study design, sample treatment, separation techniques (e.g., gel electrophoresis or liquid chromatography), protein digestion, mass spectrometry, and protein database search results. Gel spot data can be imported directly from several 2DE-gel image analysis software packages as well as spot-picking robots. Spectra (MS and MS/MS) are imported automatically during acquisition from MALDI and ESI mass spectrometers.
Many algorithms for automated spectra and search result processing are integrated. PMF spectra are calibrated and filtered for contaminant and polymer peaks (Score-booster). A single non-redundant protein list—containing only proteins that can be distinguished by the MS/MS data—can be generated from MS/MS search results (ProteinExtractor). This algorithm can combine data from different search algorithms or different experiments (MALDI/ESI, or acquisition repetitions) into a single protein list.
Navigation within the database is possible either by using the hierarchy of project, sample, protein/peptide separation, spectrum, and identification results, or by using a gel viewer plug-in. Available features include zooming, annotations (protein, spot name, etc.), export of the annotated image, and links to spot, spectrum, and protein data.
Proteinscape includes sophisticated query tools that allow data retrieval for typical questions in proteome projects. Here we present the benefit and power of usage of 6 years of continuous use of the software in over 70 proteome projects managed in house.
The original PRIDE Converter tool greatly simplified the process of submitting mass spectrometry (MS)-based proteomics data to the PRIDE database. However, after much user feedback, it was noted that the tool had some limitations and could not handle several user requirements that were now becoming commonplace. This prompted us to design and implement a whole new suite of tools that would build on the successes of the original PRIDE Converter and allow users to generate submission-ready, well-annotated PRIDE XML files. The PRIDE Converter 2 tool suite allows users to convert search result files into PRIDE XML (the format needed for performing submissions to the PRIDE database), generate mzTab skeleton files that can be used as a basis to submit quantitative and gel-based MS data, and post-process PRIDE XML files by filtering out contaminants and empty spectra, or by merging several PRIDE XML files together. All the tools have both a graphical user interface that provides a dialog-based, user-friendly way to convert and prepare files for submission, as well as a command-line interface that can be used to integrate the tools into existing or novel pipelines, for batch processing and power users. The PRIDE Converter 2 tool suite will thus become a cornerstone in the submission process to PRIDE and, by extension, to the ProteomeXchange consortium of MS-proteomics data repositories.
The global analysis of proteins is now feasible due to improvements in techniques such as two-dimensional gel electrophoresis (2-DE), mass spectrometry, yeast two-hybrid
systems and the development of bioinformatics applications. The experiments form
the basis of proteomics, and present significant challenges in data analysis, storage and
querying. We argue that a standard format for proteome data is required to enable
the storage, exchange and subsequent re-analysis of large datasets. We describe the
criteria that must be met for the development of a standard for proteomics. We have
developed a model to represent data from 2-DE experiments, including difference
gel electrophoresis along with image analysis and statistical analysis across multiple
gels. This part of proteomics analysis is not represented in current proposals for
proteomics standards. We are working with the Proteomics Standards Initiative to
develop a model encompassing biological sample origin, experimental protocols, a
number of separation techniques and mass spectrometry. The standard format will
facilitate the development of central repositories of data, enabling results to be verified
or re-analysed, and the correlation of results produced by different research groups
using a variety of laboratory techniques.
The Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI) has developed the GelML data exchange format for representing gel electrophoresis experiments performed in proteomics investigations. The format closely follows the reporting guidelines for gel electrophoresis, which are part of the Minimum Information About a Proteomics Experiment (MIAPE) set of modules. GelML supports the capture of metadata (such as experimental protocols) and data (such as gel images) resulting from gel electrophoresis so that laboratories can be compliant with the MIAPE Gel Electrophoresis guidelines, while allowing such data sets to be exchanged or downloaded from public repositories. The format is sufficiently flexible to capture data from a broad range of experimental processes, and complements other PSI formats for mass spectrometry data and the results of protein and peptide identifications to capture entire gel-based proteome workflows. GelML has resulted from the open standardisation process of PSI consisting of both public consultation and anonymous review of the specifications.
data standard; gel electrophoresis; database; ontology
The Tissue Microarray Data Exchange Specification (TMA DES) is an XML specification for encoding TMA experiment data. While TMA DES data is encoded in XML, the files that describe its syntax, structure, and semantics are not. The DTD format is used to describe the syntax and structure of TMA DES, and the ISO 11179 format is used to define the semantics of TMA DES. However, XML Schema can be used in place of DTDs, and another XML encoded format, RDF, can be used in place of ISO 11179. Encoding all TMA DES data and metadata in XML would simplify the development and usage of programs which validate and parse TMA DES data. XML Schema has advantages over DTDs such as support for data types, and a more powerful means of specifying constraints on data values. An advantage of RDF encoded in XML over ISO 11179 is that XML defines rules for encoding data, whereas ISO 11179 does not.
Materials and Methods:
We created an XML Schema version of the TMA DES DTD. We wrote a program that converted ISO 11179 definitions to RDF encoded in XML, and used it to convert the TMA DES ISO 11179 definitions to RDF.
We validated a sample TMA DES XML file that was supplied with the publication that originally specified TMA DES using our XML Schema. We successfully validated the RDF produced by our ISO 11179 converter with the W3C RDF validation service.
All TMA DES data could be encoded using XML, which simplifies its processing. XML Schema allows datatypes and valid value ranges to be specified for CDEs, which enables a wider range of error checking to be performed using XML Schemas than could be performed using DTDs.
CDEs; DTD; statistical analysis; tissue microarray; TMA DES; XML
We here present the jmzReader library: a collection of Java application programming interfaces (APIs) to parse the most commonly used peak list and XML-based mass spectrometry (MS) data formats: DTA, MS2, MGF, PKL, mzXML, mzData, and mzML (based on the already existing API jmzML). The library is optimized to be used in conjunction with mzIdentML, the recently released standard data format for reporting protein and peptide identifications, developed by the HUPO proteomics standards initiative (PSI). mzIdentML files do not contain spectra data but contain references to different kinds of external MS data files. As a key functionality, all parsers implement a common interface that supports the various methods used by mzIdentML to reference external spectra. Thus, when developing software for mzIdentML, programmers no longer have to support multiple MS data file formats but only this one interface. The library (which includes a viewer) is open source and, together with detailed documentation, can be downloaded from http://code.google.com/p/jmzreader/.
Bioinformatics; Data standard; Java; MS data processing; Proteomics standards initiative
To enhance the readability, improve the structure, and facilitate the sharing of digital imaging and communications in medicine (DICOM) files, this research proposed one kind of XML-based DICOM data format. Because XML Schema offers great flexibility for expressing constraints on the content model of elements, we used it to describe the new format, thus making it consistent with the one originally defined by DICOM. Meanwhile, such schemas can be used in the creation and validation of the XML-encoded DICOM files, acting as a standard for data transmission and sharing on the Web. Upon defining the new data format, we started with representing a single data element and further improved the whole data structure with the method of modularization. In contrast to the original format, the new one possesses better structure without loss of related information. In addition, we demonstrated the application of XSLT and XQuery. All of the advantages mentioned above resulted from this new data format.
DICOM; data format; XML; XML Schema
Isotope labeling combined with liquid chromatography–mass spectrometry (LC–MS) provides a robust platform for analyzing differential protein expression in proteomics research. We present a web service, called MaXIC-Q Web (http://ms.iis.sinica.edu.tw/MaXIC-Q_Web/), for quantitation analysis of large-scale datasets generated from proteomics experiments using various stable isotope-labeling techniques, e.g. SILAC, ICAT and user-developed labeling methods. It accepts spectral files in the standard mzXML format and search results from SEQUEST, Mascot and ProteinProphet as input. Furthermore, MaXIC-Q Web uses statistical and computational methods to construct two kinds of elution profiles for each ion, namely, PIMS (projected ion mass spectrum) and XIC (extracted ion chromatogram) from MS data. Toward accurate quantitation, a stringent validation procedure is performed on PIMSs to filter out peptide ions interfered with co-eluting peptides or noise. The areas of XICs determine ion abundances, which are used to calculate peptide and protein ratios. Since MaXIC-Q Web adopts stringent validation on spectral data, it achieves high accuracy so that manual validation effort can be substantially reduced. Furthermore, it provides various visualization diagrams and comprehensive quantitation reports so that users can conveniently inspect quantitation results. In summary, MaXIC-Q Web is a user-friendly, interactive, robust, generic web service for quantitation based on ICAT and SILAC labeling techniques.
Differential proteome studies are a powerful tool for the analysis of differences between two sample states. A challenge encountered in any proteome study is the reproducibility of the sample preparation and data analysis. The significance analysis of the results and the extent to which changes can reliably be detected are affected by this.
We studied the changes of the proteome during cell differentiation using a combination of large format 2D gel electrophoresis, image analysis, and mass spectrometry.
The basis for any analysis is the reproducibility of the results and the study design. Firstly, the reproducibility of large-format 2D gel electrophoresis was shown. Two samples of the same patient were analyzed using three replicate gels each. The spot quantitation of the two samples was found to be in good agreement. The relative mean standard deviation of the spot intensities within the replicate gels was 20% coefficient of variance. This allows us to analyze changes in the protein spot intensity that are smaller than a factor of two. The study design was optimized in order to account for technical and biological variation.
In the main study, 1800–2000 spots were quantified per gel. The large patient heterogeneity did not allow us to use a strict fold-change criterion for the selection of significantly changed spots between the two sample states. The variation of the spot intensity in one patient group was very much dependent on the nature of each individual protein. Therefore, a student’s t-test was employed to calculate the statistical significance for each spot. A total of 31 protein spots were found to be changed upon differentiation. Of these, 17 spots were unique for one of the samples, and another 14 spots were found to be highly significant (P = 99.9%). The effect of the Bonferroni correction and the false discovery rate is evaluated.
The use of human urine as a diagnostic tool has many advantages, such as ease of sample acquisition and noninvasiveness. However, the discovery of novel biomarkers, as well as biomarker patterns, in urine is hindered mainly by a lack of comparable datasets. To fill this gap, we assembled a new urinary fingerprint database. Here, we report the establishment of a human urinary proteomic fingerprint database using urine from 200 individuals analysed by SELDI-TOF (surface enhanced laser desorption ionisation-time of flight) mass spectrometry (MS) on several chip surfaces (SEND, HP50, NP20, Q10, CM10, and IMAC30). The database currently lists 2490 unique peaks/ion species from 1172 nonredundant SELDI analyses in the mass range of 1500 to 150000. All unprocessed mass spectrometric scans are available as “.xml” data files. Additionally, 1384 peaks were included from external studies using CE (capillary electrophoresis)-MS, MALDI (matrix assisted laser desorption/ionisation), and CE-MALDI hybrids. We propose to use this platform as a global resource to share and exchange primary data derived from MS analyses in urinary research.
Summary: Mass spectrometry-based proteomics stands to gain from additional analysis of its data, but its large, complex datasets make demands on speed and memory usage requiring special consideration from scripting languages. The software library ‘mspire’—developed in the Ruby programming language—offers quick and memory-efficient readers for standard xml proteomics formats, converters for intermediate file types in typical proteomics spectral-identification work flows (including the Bioworks .srf format), and modules for the calculation of peptide false identification rates.
Availability: Freely available at http://mspire.rubyforge.org. Additional data models, usage information, and methods available at http://bioinformatics.icmb.utexas.edu/mspire
Mass spectrometry is an important technique for analyzing proteins and other biomolecular compounds in biological samples. Each of the vendors of these mass spectrometers uses a different proprietary binary output file format, which has hindered data sharing and the development of open source software for downstream analysis. The solution has been to develop, with the full participation of academic researchers as well as software and hardware vendors, an open XML-based format for encoding mass spectrometer output files, and then to write software to use this format for archiving, sharing, and processing. This chapter presents the various components and information available for this format, mzML. In addition to the XML schema that defines the file structure, a controlled vocabulary provides clear terms and definitions for the spectral metadata, and a semantic validation rules mapping file allows the mzML semantic validator to insure that an mzML document complies with one of several levels of requirements. Complete documentation and example files insure that the format may be uniformly implemented. At the time of release there already existed several implementations of the format and vendors have committed to supporting the format in their products.
file format; mzML; standards; XML; controlled vocabulary
The field of proteomics, particularly the application of mass spectrometry analysis to protein samples, is well-established and growing rapidly. Proteomics studies generate large volumes of raw experimental data and inferred biological results. To facilitate the dissemination of these data, centralized data repositories have been developed that make the data and results accessible to proteomics researchers and biologists alike. This review of proteomics data repositories focuses exclusively on freely-available, centralized data resources that disseminate or store experimental mass spectrometry data and results. The resources chosen reflect a current “snapshot” of the state of resources available with an emphasis placed on resources that may be of particular interest to yeast researchers. Resources are described in terms of their intended purpose and the features and functionality provided to users.
In spite of two-dimensional gel electrophoresis (2-DE) being an effective and widely used method to screen the proteome, its data standardization has still not matured to the level of microarray genomics data or mass spectrometry approaches. The trend toward identifying encompassing data standards has been expanding from genomics to transcriptomics, and more recently to proteomics. The relative success of genomic and transcriptomic data standardization has enabled the development of central repositories such as GenBank and Gene Expression Omnibus. An equivalent 2-DE-centric data structure would similarly have to include a balance among raw data, basic feature detection results, sufficiency in the description of the experimental context and methods, and an overall structure that facilitates a diversity of usages, from central reposition to local data representation in LIMs systems.
Results & Conclusion
Achieving such a balance can only be accomplished through several iterations involving bioinformaticians, bench molecular biologists, and the manufacturers of the equipment and commercial software from which the data is primarily generated. Such an encompassing data structure is described here, developed as the mature successor to the well established and broadly used earlier version. A public repository, AGML Central, is configured with a suite of tools for the conversion from a variety of popular formats, web-based visualization, and interoperation with other tools and repositories, and is particularly mass-spectrometry oriented with I/O for annotation and data analysis.
Technological advances in mass spectrometry and other detection methods are leading to larger and larger proteomics datasets. However, when papers describing such information are published the enormous volume of data can typically only be provided as supplementary data in a tabular form through the journal website. Several journals in the proteomics field, together with the Human Proteome Organization's (HUPO) Proteomics Standards Initiative and institutions such as the Institute for Systems Biology are working towards standardizing the reporting of proteomics data, but just defining standards is only a means towards an end for sharing data. Data repositories such as ProteomeCommons.org and the Open Proteomics Database allow for public access to proteomics data but provide little, if any, interpretation.
Results & conclusion
Here we describe PrestOMIC, an open source application for storing mass spectrometry-based proteomic data in a relational database and for providing a user-friendly, searchable and customizable browser interface to share one's data with the scientific community. The underlying database and all associated applications are built on other existing open source tools, allowing PrestOMIC to be modified as the data standards evolve. We then use PrestOMIC to present a recently published dataset from our group through our website.
Tissue MicroArrays (TMAs) are a high throughput technology for rapid analysis of protein expression across hundreds of patient samples. Often, data relating to TMAs is specific to the clinical trial or experiment it is being used for, and not interoperable. The Tissue Microarray Data Exchange Specification (TMA DES) is a set of eXtensible Markup Language (XML)-based protocols for storing and sharing digitized Tissue Microarray data. XML data are enclosed by named tags which serve as identifiers. These tag names can be Common Data Elements (CDEs), which have a predefined meaning or semantics. By using this specification in a laboratory setting with increasing demands for digital pathology integration, we found that the data structure lacked the ability to cope with digital slide imaging in respect to web-enabled digital pathology systems and advanced scoring techniques.
Materials and Methods:
By employing user centric design, and observing behavior in relation to TMA scoring and associated data, the TMA DES format was extended to accommodate the current limitations. This was done with specific focus on developing a generic tool for handling any given scoring system, and utilizing data for multiple observations and observers.
DTDs were created to validate the extensions of the TMA DES protocol, and a test set of data containing scores for 6,708 TMA core images was generated. The XML was then read into an image processing algorithm to utilize the digital pathology data extensions, and scoring results were easily stored alongside the existing multiple pathologist scores.
By extending the TMA DES format to include digital pathology data and customizable scoring systems for TMAs, the new system facilitates the collaboration between pathologists and organizations, and can be used in automatic or manual data analysis. This allows complying systems to effectively communicate complex and varied scoring data.
CDEs; DTD; tissue microarray; TMA DES; virtual pathology; XML
Motivation: Liquid chromatography tandem mass spectrometry (LC-MS/MS) is the predominant method to comprehensively characterize complex protein mixtures such as samples from prefractionated or complete proteomes. In order to maximize proteome coverage for the studied sample, i.e. identify as many traceable proteins as possible, LC-MS/MS experiments are typically repeated extensively and the results combined. Proteome coverage prediction is the task of estimating the number of peptide discoveries of future LC-MS/MS experiments. Proteome coverage prediction is important to enhance the design of efficient proteomics studies. To date, there does not exist any method to reliably estimate the increase of proteome coverage at an early stage.
Results: We propose an extended infinite Markov model DiriSim to extrapolate the progression of proteome coverage based on a small number of already performed LC-MS/MS experiments. The method explicitly accounts for the uncertainty of peptide identifications. We tested DiriSim on a set of 37 LC-MS/MS experiments of a complete proteome sample and demonstrated that DiriSim correctly predicts the coverage progression already from a small subset of experiments. The predicted progression enabled us to specify maximal coverage for the test sample. We demonstrated that quality requirements on the final proteome map impose an upper bound on the number of useful experiment repetitions and limit the achievable proteome coverage.
Contact: email@example.com; firstname.lastname@example.org
Effective proteome analyses are based on interplay between resolution and detection. It had been claimed that resolution was the main factor limiting the use of two-dimensional gel electrophoresis. Improved protein detection now indicates that this is unlikely to be the case. Using a highly refined protocol, the rat brain proteome was extracted, resolved, and detected. In order to overcome the stain saturation threshold, high abundance protein species were excised from the gel following standard imaging. Gels were then imaged again using longer exposure times, enabling detection of lower abundance, less intensely stained protein species. This resulted in a significant enhancement in the detection of resolved proteins, and a slightly modified digestion protocol enabled effective identification by standard mass spectrometric methods. The data indicate that the resolution required for comprehensive proteome analyses is already available, can assess multiple samples in parallel, and preserve critical information concerning post-translational modifications. Further optimization of staining and detection methods promises additional improvements to this economical, widely accessible and effective top-down approach to proteome analysis.
Efficient analysis of results from mass spectrometry-based proteomics experiments requires access to disparate data types, including native mass spectrometry files, output from algorithms that assign peptide sequence to MS/MS spectra, and annotation for proteins and pathways from various database sources. Moreover, proteomics technologies and experimental methods are not yet standardized; hence a high degree of flexibility is necessary for efficient support of high- and low-throughput data analytic tasks. Development of a desktop environment that is sufficiently robust for deployment in data analytic pipelines, and simultaneously supports customization for programmers and non-programmers alike, has proven to be a significant challenge.
We describe multiplierz, a flexible and open-source desktop environment for comprehensive proteomics data analysis. We use this framework to expose a prototype version of our recently proposed common API (mzAPI) designed for direct access to proprietary mass spectrometry files. In addition to routine data analytic tasks, multiplierz supports generation of information rich, portable spreadsheet-based reports. Moreover, multiplierz is designed around a "zero infrastructure" philosophy, meaning that it can be deployed by end users with little or no system administration support. Finally, access to multiplierz functionality is provided via high-level Python scripts, resulting in a fully extensible data analytic environment for rapid development of custom algorithms and deployment of high-throughput data pipelines.
Collectively, mzAPI and multiplierz facilitate a wide range of data analysis tasks, spanning technology development to biological annotation, for mass spectrometry-based proteomics research.
Proteomic biomarker discovery has been called into question. Diamandis hypothesized that seemingly trivial factors, such as eating a hamburger, may cause sufficient proteomic change as to confound proteomic differences. This has been termed the hamburger effect. Little is known about the variability of complex proteomes in response to the environment. Two methods—two-dimensional gel electrophoresis (2DGE) and capillary liquid chromatography–electrospray ionization time-of-flight mass spectrometry (LCMS)—were used to study the hamburger effect in two cross-sections of the soluble fruit fly proteome. 2DGE measured abundant proteins, whereas LCMS measured small proteins and peptides. Proteomic differences between males and females were first evaluated to assess the discriminatory capability of the methods. Likewise, wild-type and white-eyed flies were analyzed as a further demonstration that genetically based proteomic differences could be observed above the background analytical variation. Then dietary interventions were imposed. Ethanol was added to the diet of some populations without significant proteomic effect. However, after a 24-h fast, proteomic differences were found using LCMS but not 2DGE. Even so, only three of ~1000 molecular species were altered significantly, suggesting that the influence of even an extreme diet change produced only modest proteomic variability, and that much of the fruit fly proteome remains relatively constant in response to diet. These experiments suggest that proteomics can be a viable approach to biomarker discovery.
proteomics; diet; 2D gel electrophoresis; liquid chromatography-mass spectrometry; Drosophila melanogaster
Despite the growing volumes of proteomic data, integration of the underlying results remains problematic owing to differences in formats, data captured, protein accessions and services available from the individual repositories. To address this, we present the ISPIDER Central Proteomic Database search (http://www.ispider.manchester.ac.uk/cgi-bin/ProteomicSearch.pl), an integration service offering novel search capabilities over leading, mature, proteomic repositories including PRoteomics IDEntifications database (PRIDE), PepSeeker, PeptideAtlas and the Global Proteome Machine. It enables users to search for proteins and peptides that have been characterised in mass spectrometry-based proteomics experiments from different groups, stored in different databases, and view the collated results with specialist viewers/clients. In order to overcome limitations imposed by the great variability in protein accessions used by individual laboratories, the European Bioinformatics Institute's Protein Identifier Cross-Reference (PICR) service is used to resolve accessions from different sequence repositories. Custom-built clients allow users to view peptide/protein identifications in different contexts from multiple experiments and repositories, as well as integration with the Dasty2 client supporting any annotations available from Distributed Annotation System servers. Further information on the protein hits may also be added via external web services able to take a protein as input. This web server offers the first truly integrated access to proteomics repositories and provides a unique service to biologists interested in mass spectrometry-based proteomics.