|Home | About | Journals | Submit | Contact Us | Français|
Policies supporting the rapid and open sharing of genomic data have directly fueled the accelerated pace of discovery in large-scale genomics research. The proteomics community is starting to implement analogous policies and infrastructure for making large-scale proteomics data widely available on a pre-competitive basis. On August 14, 2008, the National Cancer Institute (NCI) convened the “International Summit on Proteomics Data Release and Sharing Policy” in Amsterdam, the Netherlands, to identify and address potential roadblocks to rapid and open access to data.
The six principles agreed upon by key stakeholders at the summit addressed issues surrounding 1) timing, 2) comprehensiveness, 3) format, 4) deposition to repositories, 5) quality metrics, and 6) responsibility for proteomics data release. This summit report explores various approaches to develop a framework of data release and sharing principles that will most effectively fulfill the needs of the funding agencies and the research community.
On August 14, 2008, members of the international proteomics community met for a one-day summit in Amsterdam, the Netherlands convened by the National Cancer Institute (NCI) of the U.S. National Institutes of Health (NIH).1 This summit was undertaken to address what is seen as a considerable obstacle to accelerating the pace of discovery in proteomic research: the lack of widely followed policies governing the rapid release of large-scale proteomic data into the public domain, taking into account data quality, standards and integration, intellectual property, ethics, and sustainability (ensuring that there are incentives to the creation of data sets).
Rapid public data release has long been standard practice within the large-scale genomics community. It also is standard practice for the field of macromolecular structure determination. It is widely felt that this practice—made possible by the existence of universally endorsed policies governing the standards for and the availability of data in the public domain, as well as centralized repositories and portals for depositing and accessing such data—has been a driver of the rapid pace of genomic discovery. The proteomics community would benefit greatly from adopting an appropriately similar practice.
Representatives from the proteomics community were well represented at this focused meeting. Attendees included data producers, data users, databases repositories, scientific journals (Journal of Proteome Research, Molecular and Cellular Proteomics, Nature Biotech, and PROTEOMICS and PROTEOMICS - Clinical Applications), and funding agencies (National Cancer Institute, Department of Energy, European Commission, Wellcome Trust, and Genome Canada).
This post-summit document discusses the basic principles underlying data release in genomic research, the challenges to developing similar principles in the proteomics domain, and also how to synthesize the data release and sharing principles proposed by the Amsterdam summit attendees.
The data sharing policies of the U.S. National Human Genome Research Institute (NHGRI) and other genomic research funding bodies (i.e., those that were engaged in the International Human Genome Sequencing Consortium) are derived from a series of principles discussed and agreed upon at the First International Strategy Meeting on Human Genome Sequencing, held in Bermuda in 1996.2 Called the “Bermuda Principles,” these guidelines were intended to apply to “all human genomic sequences generated by large-scale sequencing centers, funded for the public good, in order to prevent such centers establishing a privileged position in the exploitation and control of human sequence information.” As such, they state that genomic sequences should be “freely available and in the public domain as soon as possible in order to encourage research and development, and to maximize its benefit to society.” In addition, the principles also called on genomic research centers to release sequence assemblies as soon as possible and to submit finished annotated sequences to public databases immediately.
Knowing that the availability of high quality data was of supreme importance to the success of the Human Genome Project and to subsequent efforts to translate the knowledge it generated, the Bermuda Principles were expanded at the Second International Strategy Meeting on Human Genome Sequencing to include standards for sequence quality and suggested standards for sequence annotation.3 In addition, guidelines for scientific claims and etiquette were proposed so as to minimize conflict within the community regarding the rights of data producers and users.
The principles promulgated in Bermuda were reaffirmed at a meeting sponsored by the Wellcome Trust in 2003.4 Meeting attendees also further expanded upon the Bermuda Principles in two ways. First, it was agreed that the principles of pre-publication data release should be extended to “other types of data from other large-scale production centers specifically established as ‘community resource projects.” Such projects were defined by the meeting participants as projects “specifically devised and implemented to create a set of data, reagents, or other material whose primary utility will be as a resource for the broad scientific community.”
Second, the meeting participants addressed conflicts between the interests of data producers—who desire to publish the first analyses of their own data—and those of data users—the members of the scientific community seeking rapid access to genomic data for further study. It was agreed that each of three core constituencies in large-scale biological research—data producers, data users, and funding agencies supporting and facilitating such research—shared responsibility for ensuring the growth and development of community resource projects while addressing each constituencies’ interests.
The challenges to rapid proteomic data release can be divided into three categories: technical, infrastructural, and policy. Each category, however, impacts the other two. Therefore, in developing principles for proteomic data release, summit participants took a comprehensive approach, addressing each category as it applies to the overall issue of data release.
Such challenges stem from the variability that exists in nearly every aspect of proteomic data generation, interpretation, and presentation. Proteomic data can be generated using a long and growing list of experimental platforms. Mass spectra alone can be generated by MS, tandem MS, liquid chromatography-MS, and other methods. A single instrument platform can be used to produce more than one kind of data. Tandem MS, for instance, produces data on ionized peptides that can be divided into quantitative data and identification data.
Individual laboratories also tend to develop their own processes and procedures for equipment calibration; thus, mass spectra generated on the same instrument by two different laboratories using the same reagents may be incomparable because each lab calibrates its equipment differently. Also, no standardized sources exist for experimental reagents, adding to the difficulties of accurately comparing or replicating data generated across laboratories.
Raw, unprocessed data is considered to be the best and most accurate representation of an experiment’s results. However, numerous instruments have been developed for each platform, each of which produces raw data in a proprietary format developed by the instrument’s manufacturer. Thus, raw data can be difficult, if not impossible, to interpret or compare across laboratories unless 1) a data user has the same instrument and software package as a data producer; or 2) the data are converted to an open format. However, data format standards have not yet been widely agreed upon within the community. Currently, the community standard format mzML5 and the accompanying controlled vocabulary are still in the initial stages of broad community acceptance. There is also the risk that some information will be lost in the process of converting data into an open format.
In addition to variations in data generation, there also exist a variety of analytical options for peptide identification (e.g., identification based on peptide mass and retention time, comparison of fragmentation data to theoretical fragmentation, comparison of fragmentation data to previously observed data stored in a spectral library) and quantitation (e.g. spectral counting, quantitation from MS peak signal intensity or peak area, quantitation from MSMS fragment signal intensity). There are some half-dozen protein sequence databases available for searching, each varying in its level of completeness and redundancy. While search engine scoring schemes have made tremendous gains, peptide identification confidence scores can be influenced by a number of factors.
The infrastructure for public deposition of proteomic data is evolving. In any given field, multiple repositories often arise more or less simultaneously. For instance, genomics researchers have a number of options for where to deposit sequence data (e.g., GenBank, EMBL, DDBJ). The availability of multiple repositories can be beneficial to the scientific community, as through collaboration and data sharing repositories can increase coverage, reduce duplication of effort, and gain some measure of security (e.g., data redundancy in the event of database failure or closure). This is already done to some extent: data deposited with the U.S. National Center for Biotechnology Information (NCBI) is mirrored at the European Bioinformatics Institute (EBI) and DNA Data Bank of Japan (DDBJ), and vice versa, to ensure long-term security.
While a few public repositories for proteomic data do exist (e.g., GPMDB, UniProtKB, Peptide Atlas, PRIDE and the newly formed NCBI Peptidome), they differ in the formats and kinds of data deposited (structural versus sequence, raw versus processed, uncurated versus curated). It appears that Tranche may be the primary mechanism that can serve the field as a repository for storing/serving raw data files. There has yet to emerge any international or centralized networks of repositories capable of reinforcing each other, although ProteomExchange is one such promising repository that is in its infancy6. Also, because each uses a different format, researchers may not be able to access all desired information from any single repository, and they may have to learn entirely different systems for accessing information for each.
Questions have long existed as to who holds the responsibility for setting and enforcing guidelines within the proteomics community, including guidelines for the submission of data for publication and standard metrics for assessing the quality of proteomic data (e.g., MS, protein affinity arrays) submitted for release. Such guidelines are necessary to ensure that enough information is provided to the community to explain an experiment, provide an assessment of the reliability of the data, and provide the data that support the results.
Currently, proteomics journals each develop their own guidelines for data submission. These guidelines can differ greatly in scope and stringency. The journal Molecular & Cellular Proteomics , for instance, has developed a set of publication guidelines based on discussions held in Paris in 2005.7 These “Paris Guidelines” address the publication of protein sequence, quantitation, and post-translational modification data, but they have not been broadly adopted. Some journals encourage authors to adhere to these guidelines, while other journals such as PROTEOMICS have produced their own.8
In addition, the standards for deposition of data in centralized repositories are still evolving. For example, the HUPO Proteomics Standards Initiative (HUPO-PSI) has published standards for proteomic data representation, specifically for MS and protein-protein interaction studies, including minimal reporting requirements, standard formats, common sets of controlled vocabularies and/or ontologies, annotations, and validation guidelines.9 These standards, though, are not yet universally accepted.
The principles agreed upon at the Amsterdam summit are intended to address the major items required for the development of a useful and successful proteomic data release policy and to account for the challenges noted above where possible. The principles agreed upon include:
The release of high quality data following standardized approaches would put the pace of proteomic research on a trajectory similar to that seen in large-scale genomics research. While numerous challenges remain in defining the policies and procedures for release of data into the public domain, the proteomics community is calling loudly for leading entities (e.g., funding agencies, journals, standards working groups, international societies) to produce the necessary guidelines. It is hoped that the Principles proposed herein will be considered and discussed by the community at large and will serve as a starting point for bringing proteomic data release practices in line with those of the genomics community as appropriate.
We thank Ruedi Aebersold and Anna D. Barker for input and advice in planning the International Summit on Proteomics Data Release and Sharing Policy.