On September 18, 2010, members of the international proteomics community met for a one-day workshop in Sydney, Australia, that was convened by the National Cancer Institute (NCI) of the U.S. National Institutes of Health (NIH). This workshop was held to address the lack of widely implementable policies governing the adoption and use of quality metrics for open access proteomic data, particularly concerning tandem mass spectrometry data used to identify and quantify proteins based on fragmentation of peptide ions. Parallel efforts can be implemented for protein capture-based data.
Clear data policy has aided the advance of other data-driven fields of research. Data release policies for DNA sequencing have enabled widespread access to data from the Human Genome Project and other large-scale sequencing efforts.1
The development of Minimum Information About a Microarray Experiment (MIAME) defined reporting standards for microarray experiments.2
Researchers, funding agencies, and journals would all benefit from data policy clarification in proteomics as well.
This workshop was intended to further a process that, for proteomics, began in 2004. Editors from the journal, Molecular and Cellular Proteomics
(MCP), recognized the need for manuscript submission guidelines to assist reviewers and readers of the paper in understanding the methods by which the authors acquired, processed, and analyzed their data.3
This was followed by an expanded set of guidelines developed by an invited group of participants that met in Paris, France in 2005.4
These guidelines delineate elements required in manuscripts detailing LC-MS/MS proteomic inventories (such as assignment of peptides, assessment of false positive rate, and inference of proteins) and other types of experiments (such as quantitative measurements, post-translational modification assignments, and peptide-mass fingerprinting). Soon after the journal MCP
adopted these “Paris Guidelines” as its standard, other journals began to adopt variants of these guidelines.5
A 2009 Philadelphia workshop revisited these guidelines to require, for the first time, deposition of raw data in a public repository.*
In 2008, MCP developed the first set of guidelines for clinical proteomics papers to establish a baseline of credibility for articles in this field.6
Concomitantly, the Human Proteome Organization’s Proteomics Standards Initiative (HUPO-PSI) developed guidance for reporting data and metadata (MIAPE) and sought to standardize data formats for mass spectrometry data (mzML).7
These efforts provide a solid foundation for advancing proteomics data policy.
To complement the activities of MCP and HUPO-PSI, the NCI sponsored a 2008 International Summit at which the proteomics community produced the six Amsterdam Principles to promote open access to proteomic data8
addressing issues of: timing, comprehensiveness, format, deposition to repositories, quality metrics, and responsibility for proteomic data release. MCP has defined policies for its published articles for the principles involving timing, deposition to repositories, and responsibility for proteomic data release. MIAPE and mzML provide solid frameworks for advancing the principles of comprehensiveness and format. The principle for quality metrics, however, has not yet been matched with community-derived, comprehensive guidance. The Amsterdam Principles state that “central repositories should develop threshold metrics for assessing data quality. These metrics should be developed in a coordinated manner, both with the research community and among each other, so as to ensure interoperability. As data become shared through such repositories, their value will become obvious to the community, and momentum will grow to sustain data release.”
The NCI invited key thought-leaders and stakeholders in proteomics to the Sydney Workshop, including data producers and users, managers of database repositories, editors of scientific journals, and representatives of funding agencies. These stakeholders were asked, “How good is good enough, when it comes to MS-based proteomic data?” The answers to this question would form the quality criteria for shared data, benefiting both producers and users of proteomic data.
The workshop focused on three use cases: 1) users of public proteomic data, 2) reviewers of journal articles, and 3) multi-site, data production projects in which unpublished data are shared among laboratories. In the first case, public proteomic data may be used for a variety of applications, covering a range of prerequisites for data quality and metadata. What level of annotation and which quality metrics would help a potential user quickly determine if a data set is appropriate for the intended application? In the second case, reviewers need to determine the quality of data that accompany a submitted manuscript as a means to ascertain the strength by which the data support experimental findings in the manuscript. Apart from repeating the full analysis of the data, what set of metrics would facilitate the task of reviewers? Third, team science in the context of systems biology presents opportunities and challenges for integrating data sets generated by different labs with distinct quality assurance protocols, often on different analysis platforms. Participants in the workshop stated that standards and metrics must not be oppressive, stifling innovation and methods development and potentially hindering the evolution of proteomic technologies. However, for data produced to answer a biological question rather than to demonstrate technology, what metrics would bolster confidence that data are of sufficient quality to be integrated with experimental results spanning multiple labs and platforms?
Data from proteomics experiments can assume a variety of forms, depending on the experimental design. Evaluating data set quality is heavily dependent upon accompanying annotation. Meta-data relate the experimental design to produced data, making clear which files or records are associated with a particular sample. Protocols may be documented via Standard Operating Procedures, but most data sets are presently dependent on methods explained briefly or by reference in associated publications. Instrument data are generally considered the core of public data sets, but the format in which they are published may have a considerable impact on their usability. Derived information from these data may include raw peptide identifications or transition peak areas. Often this level is bypassed, with a more heavily processed protein spectral count table or protein ratio table presented with minimal explanation of software tools used to move from instrument data to derived information. Finally, experimental findings reflect the high-level analysis motivated by the experimental design. A list of differentially-expressed proteins, enriched networks, or biomarker candidates connects to derived information through higher-level analyses conducted in statistical or systems biology tools.
This meeting report discusses the basic principles underlying data quality metrics in proteomics and the challenges to develop and implement these principles. Its purpose is to serve as a focus of discussion and action that will guide the continuing commitment of the field to conduct high quality research and to make the product of that research useful to the global research community.