|Home | About | Journals | Submit | Contact Us | Français|
The field of proteomics, particularly the application of mass spectrometry analysis to protein samples, is well-established and growing rapidly. Proteomics studies generate large volumes of raw experimental data and inferred biological results. To facilitate the dissemination of these data, centralized data repositories have been developed that make the data and results accessible to proteomics researchers and biologists alike. This review of proteomics data repositories focuses exclusively on freely-available, centralized data resources that disseminate or store experimental mass spectrometry data and results. The resources chosen reflect a current “snapshot” of the state of resources available with an emphasis placed on resources that may be of particular interest to yeast researchers. Resources are described in terms of their intended purpose and the features and functionality provided to users.
In 1996, the budding yeast Saccharomyces cerevisiae became the first fully-sequenced eukaryotic system and the subsequent focus of seminal mass spectrometry-based proteomics studies.[1, 2] In the following years, yeast has remained an indispensible system for post-genomic analysis and continues to be an essential platform for the development and advancement of new proteomics techniques. These techniques are continually being extended by researchers utilizing other systems, and have been applied to a multitude of other organisms in the course of many proteomics studies.
Given proteomics’ prevalence and broad application, the audience interested in proteomics data, at least at some level, is very diverse; including clinicians developing medical diagnostic and treatment tools, biologists elucidating the mechanisms of regulation of specific proteins, researchers developing new software tools for the analysis and interpretation of mass spectrometry data and many others. While it is obvious that, at the highest level, proteomics data repositories exist to make data available to all of these users, it is important to note that proteomics data come in different forms and that different users may have very different needs with regards to the type of proteomics data they are searching for and can use. Examples include a researcher developing a new computational algorithm to study protein-protein interactions who is simply interested in proteins identified in many samples and their associated confidence scores, a researcher developing new computational techniques interested in only the raw mass spectra produced directly by the mass spectrometer, or a biologist specifically interested in seeing the evidence that supports the identification of predicted sites of post translational modifications in their protein of interest and has no ability to make use of the raw data files available in some proteomics repositories. These are all reasonable requirements of proteomics data repositories, but perhaps not of the same data repository. No proteomics data repository will be all things to all users, and attempting to define a specific and universal purpose for proteomics data repositories is probably best avoided; as is attempting to evaluate all repositories with the same set of criteria.
For this review, we describe the purpose and use for each of the data repositories independently. Priority was given to large, well-established mass spectrometry data repositories that have the most relevance to yeast researchers. Though none of the resources included in this review is entirely yeast-specific, each of the included repositories potentially provides a substantial benefit to yeast proteomics researchers and biologists. The considerations of researchers working in yeast, with regards to viewing and sharing proteomics data, largely overlaps the needs of researchers working in other organisms. However, S. cerevisiae does have the advantage of being an exceedingly well-annotated model organism. Researchers studying proteomics results for this organism could conceivably benefit by data being presented in the context of the large body of existing data for proteins in this organism and existing protein annotations that simply are not available in many other organisms. Special consideration was given to resources that attempt to leverage existing annotations when searching for and viewing proteomics data.
We excluded popular data resources, many of which are well-known to yeast researchers, that do not specifically include experimental mass spectrometry data such as the Saccharomyces Genome Database (SGD), the S. pombe GeneDB, Database of Interacting Proteins (DIP) and the General Repository for Interaction Datasets (BioGRID). Additionally, we chose to exclude significant resources specifically focused on non-yeast organisms, such as Human Proteinpedia which is a human-specific resource. Finally, Peptidome from NCBI (http://www.ncbi.nlm.nih.gov/projects/peptidome/) was not reviewed as it was just recently released. Peptidome currently contains only a few sample datasets procured from the other resources but it is expected to rapidly become a major proteomics data repository that will house data (peptides, proteins, raw data in open formats, and associated experimental metadata) associated with future proteomics publications. A summary of the reviewed resources is listed in Table 1.
The NCRR Yeast Resource Center (YRC) is a multi-disciplinary Biomedical Technology Resource Center focused on exploiting the budding yeast Saccharomyces cerevisiae as a model organism to develop tools and technologies to characterize proteins and proteomes, which may then be applied to other systems. Principle technologies employed by the YRC include protein mass spectrometry, protein structure prediction and design, fluorescence microscopy (including fluorescence energy transfer), yeast two-hybrid, and development of novel computational biology algorithms. In addition to core technology development projects, the YRC participates in more than a hundred proteomics collaborations each year with researchers around the world. The motivation behind the development of the YRC Public Data Repository (YRC PDR) was to develop a single, unified interface for the dissemination of experimental proteomics data generated by these disparate technologies in a manner accessible to the general biological research community. It should be noted that the YRC PDR contains proteomics data from many different organisms and is not exclusive to yeast. The YRC PDR is freely available at http://www.yeastrc.org/pdr/.
All data contained in the YRC PDR are associated with proteins. To facilitate finding these data, a protein search engine was implemented that allows users to locate proteins of interest using the names, systematic accession IDs and descriptions from many databases including the Saccharomyces Genome Database (SGD), WormBase, HUGO HGNC, NCBI nr (including GenBank CDS translations and RefSeq), Swiss-Prot/UniProtKB, MIPS, International Protein Index (IPI), FlyBase, and others. Sequences, names and descriptions are mirrored from these databases regularly. Searches may also be refined based on Gene Ontology terms, taxonomical information or the availability of specific types of experimental data for matched proteins. As of this writing, the YRC PDR’s database contains nearly 32 million protein references from these databases. An example of a possible query would be to search for all proteins having the Gene Ontology terms “kinetochore” and “spindle pole”, from the organism S. cerevisiae and limit the results to proteins for which mass spectrometry and yeast two-hybrid data are available. The PDR protein overview page for NKP2/YLR315W, one of the 7 matches to this query, is displayed in Figure 1.
Once a protein of interest is found, a user may view an overview page that contains names and descriptions found for that protein from the mirrored databases, Gene Ontology annotations and other protein annotation data. These data are associated with links that provide more information, such as links to external protein databases, publications used as evidence in Gene Ontology annotations, and an in-line graphical interface for browsing the Gene Ontology graph. An overview is presented of the experimental data available for the protein of interest and links are provided to data viewing interfaces developed for each of the types of data supported by the YRC PDR. These include protein mass spectrometry data, including identified peptides, proteins, post-translational modifications and associated statistics; fluorescence microscopy data, including protein subcellular localizations and fluorescence energy transfer experiments; protein domain and three-dimensional structure predictions; yeast two-hybrid protein-protein interactions; and protein-protein interaction prediction data, presented as an interactive, graphical interface that simplifies the navigation of this complex data.
In summary, the YRC PDR provides a powerful interface for findings proteins of interest and proteomics data associated with these proteins. Although it does not currently support public uploading of raw mass spectrometry data, the YRC PDR is a proteomics data resource that will be of special interest to yeast researchers, particularly yeast biologists interested in viewing mass spectrometry results alongside other types of proteomics data and known biological information. It displays data in a biological context that enhances interpretation and helps to make the data more accessible to a broader research community. The amount of experimental data in the YRC PDR continues to grow as data generated by collaborations with the center are added. The data are also available for download via the download page within the web site.
The PRIDE (PRoteomics IDEntifications) database is a public, user-populated proteomics data repository.[18, 19] Data generated by mass spectrometry proteomics experiments, including raw spectral data, peptides, protein identifications and associated statistics, may be uploaded, downloaded or viewed using a single, centralized web interface that is independent of the hardware or algorithms used to generate the data. This is made possible by requiring strict adherence to proteomics data standards for data uploaded to the PRIDE database, and a suite of software tools that allows researchers to achieve standards compliance for data generated by many different platforms. PRIDE is freely available at http://www.ebi.ac.uk/pride/.
Central to PRIDE’s mission as a public data repository is the ability for users to directly submit mass spectrometry data and analyzed results. PRIDE supports the submission of data generated from many platforms, provided that the data has first been converted to valid PRIDE XML files. Software tools are provided by PRIDE for converting many common proteomics data formats to PRIDE XML, chiefly the PRIDE Converter (http://code.google.com/p/pride-converter/). Examples of supported formats include Mascot dat and mgf files, X! Tandem XML, mzXML, mzData, SEQUEST result and dta files, MS2 and DTASelect. Additional support is provided for smaller and simpler datasets by the Proteome Harvest project that utilizes Microsoft Excel spreadsheets to organize data for upload.
Data uploaded to PRIDE may be designated as public or private. All users of PRIDE may view public data without registering or logging in but access to private data is restricted by association of data and registered users with collaborations. Members of collaborations may view data associated with the respective collaboration, and the creator of the collaboration controls which users are associated with the collaboration. This method of access control makes PRIDE an excellent platform for investigators to share proteomics data prior to publication, be it with fellow collaborators or reviewers.
PRIDE provides multiple options for finding data contained in the repository. Most simply, users may browse a list of experiment descriptions. Users may also browse lists of species, tissue types, cell types, Gene Ontology terms or diseases. Additionally, PRIDE provides simple and advanced search forms for finding experiments based on internal PRIDE accession numbers, protein accession strings from large protein databases such as UniProtKB IDs or NCBI gi numbers, peptide sequences, references or controlled vocabulary annotations such as Gene Ontology or the Cell Type Ontology. More advanced searching of public PRIDE data may be accomplished using the PRIDE BioMart interface, which searches a specially optimized and cached snapshot of the database that is updated weekly. This interface allows for much more highly customized and flexible searches than the standard search interfaces. An example PRIDE experiment view is shown in Figure 2. The experiment reference, contact information of the submitter, and links to both identification and spectrum details can be accessed from this page.
Once data are located, the user has the option to download the data as standards-compliant XML or to view the data within the PRIDE web interface. PRIDE provides several tools for comparing and contrasting the results from multiple mass spectrometry experiments. Additionally, there are basic web interfaces for viewing experimental metadata, protein lists, peptides, scores generated by analysis algorithms and mass spectra.
The Global Proteome Machine (GPM) is a portal to a proteomics database and open source software that was developed by Beavis Informatics. The system was developed to allow research scientists the ability to use its proteomics data and tools to interrogate a number of proteomes. At its core is a publicly accessible open source search engine named X! Tandem that identifies peptides and proteins from tandem mass spectra. The X! Tandem search engine is accessible via a number of GPM mirror sites. Due to its free availability, speed, and online web interface, it has quickly established itself as a popular search engine for the proteomics community. Results of searches performed on the GPM sites are stored and regularly collated to the Global Proteome Machine Database (GPMDB) central repository, enabling the GPMDB to be a source for a large and diverse collection of tandem mass spectra and associated peptide and protein identifications. This expansive collection of proteomics data in GPMDB, including high confidence peptide identifications and their corresponding experimental tandem mass spectra, is a valuable resource for further MS computational research. The GPMDB can be accessed at http://gpmdb.thegpm.org/.
One application of this resource is in the study of proteotypic peptides. Proteotypic peptides are defined as those that ionize and fragment well such that they are successfully identified in tandem mass spectrometry experiments. The ability to confidently predict a priori which peptides for any given protein have a high probability of being identified is valuable at many levels. In a targeted mass spectrometry experiment, the ability to select specific peptides to target from the proteins of interest can increase throughput and experimental success rate. Algorithms that attempt to determine if peptides are proteotypic for a given proteome require a large collection of successful identifications across a diverse number of experiments such as those available in the GPMDB.[28–31] These previously identified peptides can also provide additional confidence to new spectrum matches, such as displayed in the GPMDB’s protein coverage map for protein SDS22/YKL193C shown in Figure 3. Consistently identified peptides, as shown in the coverage map, have a high likelihood of not being spurious hits.
One application that directly takes advantage of these proteotypic peptides is the Proteotypic Peptide Profiling or P3 search tool at the GPM. Proteotypic peptides are collated into a sequence database file and the P3 search tool allows users to query tandem mass spectra against those databases. Full protein sequences for those peptides that are confidently identified are then queried in subsequent, integrated refinement passes. This allows for a very rapid directed search as the initial search space is small yet the subsequent search passes allow for additional peptides and post translational modifications to be identified from the full protein sequences found in the initial search.
A system that enables the collection of a large number of identifications and corresponding mass spectra offers other unique analytical opportunities. The tandem mass spectra stored in GPMDB are a rich dataset for computational researchers interested in investigating fragmentation or bond cleavage rules.[33–36] Progress has been made in this domain but further advances will lead to improved identification and validation tools for sequence to spectrum matches. A second prominent application for the experimental data stored in GPMDB is the generation of mass spectral libraries and associated library search tools. Mass spectral library searching has long been employed by analytical chemists to infer structure of unknown species.[37–40] The application of library searching to tandem mass spectra derived from collision induced dissociation of peptides is attractive because the ability to predict or infer intensities of fragment ions from a peptide sequence is currently still imperfect. A spectrum to spectrum comparison is not dependent on any understanding of fragmentation rules and typically returns more sensitive matches compared to sequence to spectrum comparisons. A critical component for the successful application of this method to tandem mass spectral analysis is the ability to generate comprehensive spectral libraries that would not be possible without a centralized mechanism for performing and storing a large number of tandem mass spectral analyses such as offered by the GPM and GPMDB. Additional details on spectral libraries and library search tools are described later.
The PeptideAtlas project was initiated to map high confidence peptide identifications to eukaryotic genomes as one component of a resource for proteomics information.[41–43] The project annotates genome sequences of multiple organisms with peptide and protein information derived primarily from tandem mass spectrometry data. It also contains a growing set of software tools and underlying infrastructure for the analysis and visualization of the proteomics data that it encapsulates.
From a user perspective, the main PeptideAtlas web interface allows one query for peptides or proteins of interest against a particular PeptideAtlas organism build or generally against all the underlying data in the repository. Peptide search results are displayed on a page with associated summary statistics. These include information such as the total number of times a peptide has been observed, from which experimental samples the peptide has been identified, and a histogram of the number of identifications from each experimental sample. An example of this information is displayed in Figure 4 for peptide ICDFGLAR (SMK1/YPR054W). Also present are links to annotated individual and consensus tandem mass spectra, an Ensembl browser view, and links to the individual experiments. Correspondingly, the results of a protein query yields a rich display of peptide observations. These include a layout of the observed peptides within the full length sequence, a list of distinct observed peptides for the protein, any available annotated SRM peptide transitions, and per experiment peptide expression abundance information.
There are currently PeptideAtlas builds for 7 organisms. For S. cerevisiae, the current Yeast PeptideAtlas build contains data from 56 experiments totaling over 57,000 distinct peptides (or over 39,000 distinct peptides that have been observed more than once). At the time the Yeast PeptideAtlas was published in late 2006, the repository contained peptide observations that aligned to 61% of the S. cerevisiae open reading frames (ORFs) and 76% of the ORFs with gene names. These results are generated by processing all raw data through a uniform analysis pipeline which is composed of data conversion to an open XML format, sequence databases searching, and applying a significance cutoff using empirically derived probabilities from PeptideProphet. The end product, derived from a diverse source of experimental data, can be used to explore, analyze and validate the yeast proteome.
As discussed in , one of the many applications of the Yeast PeptideAtlas would be as a resource for a quantitative MS experiment. Given a protein or complex of interest, one could query the atlas to identify those peptides that have been previously identified in an MS experiment, isolate those that are unique or specific to a single protein, and determine which of the peptides contains the amino acids that are required for a given type of isotopic labeling reagent. Additionally, PeptideAtlas also serves as a large data repository where one could retrieve raw data, search hits, and ProteinProphet results of the many public datasets in the repository. The PeptideAtlas project can be accessed at http://www.peptideatlas.org/.
Proteomics studies generate enormous volumes of highly technical and complex experimental data. Laboratories wishing to make raw proteomics data available to collaborators, or in conjunction with research publications, must address many basic issues associated with the storage and distribution of large and complex datasets. At the minimum, they typically must buy new hardware, including sufficient disk space and servers necessary to host a web site that provides an interface to the data. Tranche, originally developed as a graduate student project and as part of the National Resource for Proteomics and Pathways at the University of Michigan, is meant to address these issues and greatly simplify the process of dissemination of large sets of experimental proteomics data. In fact, repositories such as PRIDE, PeptideAtlas, and Human Proteinpedia are beginning to interface with Tranche as the preferred mechanism for storing and disseminating large MS data files. Tranche can be accessed from the ProteomeCommons website at https://proteomecommons.org/tranche/.
Tranche is a distributed storage platform that, as of this writing, consists of fifteen online servers, each contributing storage to the network. Data uploaded to the Tranche Network is split into discrete units and split across multiple servers. Each unit of data is stored on at least three separate servers to help ensure fault tolerance in case of failed individual servers. Tranche provides the means for third-party research labs to add to the overall capacity of the network by adding their own storage servers to the network.
Access to uploading or downloading data in Tranche is provided via a Java Web Start application launched by clicking a hyperlink within the Tranche web site, displayed in Figure 5. Once the application is loaded, users may browse and download data by project or search for specific data by hash codes assigned to data by Tranche when data is uploaded to the network. Tranche provides an array of access control and licensing options for uploaded data. Data may be publicly available to all users of Tranche, encrypted and protected by passphrases or restricted by custom license definitions. The distributed nature of storage and ability for users to upload and control access to data of nearly any size or complexity effectively solves the issues associated with storing and distributing large data files associated with MS experiments. Users may share data with collaborators and distribute data in conjunction with publications simply by establishing a project and uploading data.
All software developed for Tranche is open source and available for download from the Tranche web site. Instructions are provided for installing and managing private installations of Tranche networks, providing another optional layer of security and flexibility for researchers wishing to distribute their proteomics data. As proteomics experiments continue to grow and as publication standards are widely adopted and strictly enforced, mechanisms such as Tranche will become invaluable to support the exchange of the large proteomics datasets.
With the large scale collection of tandem mass spectra and their associated identifications being collated in proteomics laboratories and public repositories, collections of high quality MS/MS reference spectra of peptides are being generated. These tandem mass spectral libraries are being used for library search algorithms as well as for targeted analysis. Although software tools are now available for individual research laboratories to generate their own spectral libraries, most academic research labs do not have the breadth of data and/or the computational infrastructure to generate large and comprehensive libraries that central repositories, such as the GPM or PeptideAtlas, are able to.
Large collections of tandem mass spectral libraries are available from the GPM, NIST, SpectraST, and BiblioSpec projects. Although they all implement the same concept, each of the MS/MS spectral libraries has its own unique qualities. The GPM has focused on a minimalistic approach for data storage and thus its MS/MS libraries store just the 20 most intense peaks in a reference spectrum. Referred to as an Annotated Spectrum Library (ASL), tandem mass spectra of confident peptide assignments are extracted from the GPMDB and averaged together to generate reference spectra that can be searched using the X! Hunter library search tool. Additional information on X! Hunter and the corresponding mass spectral libraries of 16 different species can be found at http://www.thegpm.org/hunter/. NIST generates high quality reference libraries of peptide mass spectra using identifications from four different search engines. Consensus spectra are generated by identifying spectra with the highest sequence database search scores, aligning m/z values, selecting spectra that cluster similarly, and determining intensity values using a weighted average from the input spectra. The NIST libraries, currently available for 6 different species, can be accessed at http://peptide.nist.gov/. The SpectraST libraries are typically based on processing data through the Trans-Proteomic Pipeline (TPP) for optimal identifications and consistent statistical analysis. Although SpectraST itself can be used as a stand alone tool, one of its advantages is that it is seamlessly integrated into the TPP software suite so that searches can be invoked and viewed in a common interface. The SpectraST libraries as well as the NIST libraries in SpectraST format can be accessed from http://www.peptideatlas.org/speclib/. And lastly, the BiblioSpec libraries are generated based on data acquired and analyzed primarily within a single high throughput MS research lab. The BiblioSpec library generation pipeline chooses a single ‘best’ spectrum as opposed to generating a consensus spectrum, due to optimal performance with the BiblioSpec score function. The BiblioSpec libraries for S. cerevisiae, E. coli, and C. elegans can be downloaded from http://proteome.gs.washington.edu/software/bibliospec/.
When applied to the peptide identification problem, MS/MS based spectra library searching has distinct advantages over sequence database searching. Because actual MS/MS spectra are being compared against each other, the analysis results in more sensitive identifications than sequence database searching. This is due to the fact that fragment ion intensities do not have to be predicted from the primary sequence. A spectrum to spectrum comparison innately accounts for the optimal intensity distribution of fragment ion peaks that might otherwise need to be predicted by sequence database search algorithms. From a throughput standpoint, spectral library search algorithms run much faster than their sequence search counterparts. MS/MS library search tools skip the computational expense associated with protein sequences digestion, peptide mass calculations, and fragment ion calculations. And, the search space is limited by the size of the spectral libraries. Even with the large scale collection of data feeding into generating these reference libraries, the size of the libraries are much smaller than the search space of the corresponding sequence database searches.
There are a few potential drawbacks to the spectral library searching that should be noted. The first is that, similar to sequence database searching where a protein will not be identified if its sequence is not in the underlying sequence database being searched, spectral library searching is limited to the peptide identifications that are in the library. Novel peptides or modifications cannot be identified if they are not present in the library being queried. This will be less of an issue over time as increasingly comprehensive tandem mass spectrometry data are collected in the central repositories. Another concern is that if a reference library spectrum is based on an incorrect peptide assignment, all subsequent library matches to that spectrum will also be wrong. Careful and sophisticated library building algorithms are required to minimize this issue. Lastly, it’s unclear how effective it is to search tandem mass spectra acquired on one type of instrument, say a quadrupole time of flight, against a library composed of data acquired on a different instrument, e.g. an ion trap. It is also unclear if there are any significant negative effects of searching data acquired with modified instrument settings. As tandem mass spectral library searching becomes a more commonly accepted practice, it is expected that all of these issues will be addressed and resolved.
The collection of high quality spectral libraries can be an important resource for targeted analysis, such as Selected Reaction Monitoring (SRM) method generation and analysis. Determining which peptide and fragment ions (Q1 and Q3 masses, respectively) to monitor in an SRM experiment is greatly aided by using information in the libraries. Peptides that ionize and fragment well, such as the proteotypic peptides represented in these MS/MS libraries, are good targets for SRM experiments. Because the actual fragmentation spectra are known and generally consistent across instruments, one can choose to monitor those Q3 transitions that correspond to intense and unique fragment ion peaks in the library spectrum. Lastly, the relative intensity of transition peaks that are monitored can be compared against their expected intensities from the spectral library. This rank order and expected abundance ratios of the fragment ions can be important parameters that are used by SRM analysis tools to score the likelihood that the chromatographic signals of the monitored transitions are indicators of the peptide’s presence.
In this review, we have covered only some of the many proteomics data resources available to researchers, with a particular focus on relevance to yeast researchers. As illustrated here, there already exists an enormous amount of mass spectrometry data in the public domain and the rate is only increasing. Development of new technologies and analysis algorithms, as well as the application of these technologies to entirely new areas of medical and biological research only serve to increase the complexity of the data and the need for proteomics data repositories that can serve an increasingly more diverse audience. As these proteomics repositories grow and mature, and as innovative new resources appear, they will become progressively more indispensible to the research community by housing a more comprehensive view of proteins and whole proteomes. They should help to prevent redundant experiments by providing experimental results to researchers considering similar experiments, provide ample “raw” data for training analysis algorithms of many types and even serve as a platform for biological research by providing protein annotations to students and biologists interested in discovering what is known about proteins of interest.
To meet these goals, proteomics repositories will likely need to focus on several key areas. In order to increase adoption by researchers wishing to upload and download data, the repositories will need to more universally support industry data standards, such as mzML, during the upload and download process. They will need to work to simplify the data submission interface and enforce the inclusion of accurate and comprehensive experimental metadata, which are essential for providing context to other researchers. To facilitate meaningful biological discovery, the repositories will need to improve the accessibility of data, such as through the development of web services that allow direct access by third party biological software. They will need to integrate proteomics data with data from other databases and develop tools that help provide scientists a sense of how these data fit into the current picture of what is known about specific proteins or organisms.
This work is supported by the University of Washington's Proteomics Resource (UWPR95794) and the National Center for Research Resources of the National Institutes of Health (P41 RR11823).