|Home | About | Journals | Submit | Contact Us | Français|
Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the U.S. National Cancer Institute (NCI) convened the “International Workshop on Proteomic Data Quality Metrics” in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed up on two primary needs for the wide use of quality metrics: (1) an evolving list of comprehensive quality metrics and (2) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data.
By agreement, this article is published simultaneously in the Journal of Proteome Research, Molecular and Cellular Proteomics, Proteomics, and Proteomics Clinical Applications as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.
On September 18, 2010, members of the international proteomics community met for a one-day workshop in Sydney, Australia, that was convened by the National Cancer Institute (NCI) of the U.S. National Institutes of Health (NIH). This workshop was held to address the lack of widely implementable policies governing the adoption and use of quality metrics for open access proteomic data, particularly concerning tandem mass spectrometry data used to identify and quantify proteins based on fragmentation of peptide ions. Parallel efforts can be implemented for protein capture-based data.
Clear data policy has aided the advance of other data-driven fields of research. Data release policies for DNA sequencing have enabled widespread access to data from the Human Genome Project and other large-scale sequencing efforts.1 The development of Minimum Information About a Microarray Experiment (MIAME) defined reporting standards for microarray experiments.2 Researchers, funding agencies, and journals would all benefit from data policy clarification in proteomics as well.
This workshop was intended to further a process that, for proteomics, began in 2004. Editors from the journal, Molecular and Cellular Proteomics (MCP), recognized the need for manuscript submission guidelines to assist reviewers and readers of the paper in understanding the methods by which the authors acquired, processed, and analyzed their data.3 This was followed by an expanded set of guidelines developed by an invited group of participants that met in Paris, France in 2005.4 These guidelines delineate elements required in manuscripts detailing LC-MS/MS proteomic inventories (such as assignment of peptides, assessment of false positive rate, and inference of proteins) and other types of experiments (such as quantitative measurements, post-translational modification assignments, and peptide-mass fingerprinting). Soon after the journal MCP adopted these “Paris Guidelines” as its standard, other journals began to adopt variants of these guidelines.5 A 2009 Philadelphia workshop revisited these guidelines to require, for the first time, deposition of raw data in a public repository.* In 2008, MCP developed the first set of guidelines for clinical proteomics papers to establish a baseline of credibility for articles in this field.6 Concomitantly, the Human Proteome Organization’s Proteomics Standards Initiative (HUPO-PSI) developed guidance for reporting data and metadata (MIAPE) and sought to standardize data formats for mass spectrometry data (mzML).7 These efforts provide a solid foundation for advancing proteomics data policy.
To complement the activities of MCP and HUPO-PSI, the NCI sponsored a 2008 International Summit at which the proteomics community produced the six Amsterdam Principles to promote open access to proteomic data8 addressing issues of: timing, comprehensiveness, format, deposition to repositories, quality metrics, and responsibility for proteomic data release. MCP has defined policies for its published articles for the principles involving timing, deposition to repositories, and responsibility for proteomic data release. MIAPE and mzML provide solid frameworks for advancing the principles of comprehensiveness and format. The principle for quality metrics, however, has not yet been matched with community-derived, comprehensive guidance. The Amsterdam Principles state that “central repositories should develop threshold metrics for assessing data quality. These metrics should be developed in a coordinated manner, both with the research community and among each other, so as to ensure interoperability. As data become shared through such repositories, their value will become obvious to the community, and momentum will grow to sustain data release.”
The NCI invited key thought-leaders and stakeholders in proteomics to the Sydney Workshop, including data producers and users, managers of database repositories, editors of scientific journals, and representatives of funding agencies. These stakeholders were asked, “How good is good enough, when it comes to MS-based proteomic data?” The answers to this question would form the quality criteria for shared data, benefiting both producers and users of proteomic data.
The workshop focused on three use cases: 1) users of public proteomic data, 2) reviewers of journal articles, and 3) multi-site, data production projects in which unpublished data are shared among laboratories. In the first case, public proteomic data may be used for a variety of applications, covering a range of prerequisites for data quality and metadata. What level of annotation and which quality metrics would help a potential user quickly determine if a data set is appropriate for the intended application? In the second case, reviewers need to determine the quality of data that accompany a submitted manuscript as a means to ascertain the strength by which the data support experimental findings in the manuscript. Apart from repeating the full analysis of the data, what set of metrics would facilitate the task of reviewers? Third, team science in the context of systems biology presents opportunities and challenges for integrating data sets generated by different labs with distinct quality assurance protocols, often on different analysis platforms. Participants in the workshop stated that standards and metrics must not be oppressive, stifling innovation and methods development and potentially hindering the evolution of proteomic technologies. However, for data produced to answer a biological question rather than to demonstrate technology, what metrics would bolster confidence that data are of sufficient quality to be integrated with experimental results spanning multiple labs and platforms?
Data from proteomics experiments can assume a variety of forms, depending on the experimental design. Evaluating data set quality is heavily dependent upon accompanying annotation. Meta-data relate the experimental design to produced data, making clear which files or records are associated with a particular sample. Protocols may be documented via Standard Operating Procedures, but most data sets are presently dependent on methods explained briefly or by reference in associated publications. Instrument data are generally considered the core of public data sets, but the format in which they are published may have a considerable impact on their usability. Derived information from these data may include raw peptide identifications or transition peak areas. Often this level is bypassed, with a more heavily processed protein spectral count table or protein ratio table presented with minimal explanation of software tools used to move from instrument data to derived information. Finally, experimental findings reflect the high-level analysis motivated by the experimental design. A list of differentially-expressed proteins, enriched networks, or biomarker candidates connects to derived information through higher-level analyses conducted in statistical or systems biology tools.
This meeting report discusses the basic principles underlying data quality metrics in proteomics and the challenges to develop and implement these principles. Its purpose is to serve as a focus of discussion and action that will guide the continuing commitment of the field to conduct high quality research and to make the product of that research useful to the global research community.
One of the major conclusions from the Amsterdam Summit was that data sharing among members of the proteomics community should be required. While there is widespread consensus on the importance of making proteomics data publicly available, the full infrastructure to do so does not yet exist. ProteomeCommons/Tranche9 and ProteomeXchange10 have been developed to meet key needs, but do not yet constitute the comprehensive system required to fully support the field. Participants at the Sydney Workshop called for a sustainably-funded repository or set of linked repositories that would work closely with journals to ensure long-term access to proteomics data.
A major problem confronting the reuse of published data files is the lack of information concerning the experiment that led to data generation. Even when experimental annotation is available, it is often cryptic or missing key information, inviting errors in mapping data to results. Contacting authors is often necessary to seek clarification.
Given public access and annotation for a data set, users should next be able to assess its quality, preferably prior to download. Data quality has become more important due to rapid evolution of technologies as well as the growing size and quantity of repository data sets. Given the option of several potential data sets, users should be able to compare among them on a common basis and evaluate their comparability or compatibility.
With major repositories joining forces to create ProteomeXchange10, quality assurance and quality control of public data sets will be necessary to separate what is useful from the rest. Moreover, research societies and funding agencies are placing importance on proteomic technologies and findings with the launch of the HUPO Human Proteome Project11, the NCI-sponsored Clinical Proteomic Technologies for Cancer (CPTC) Initiative,12 and other related initiatives, raising the importance of both data access and quality. Data quality standards will enable tools for tracking samples, measuring analytical variation, flagging incomplete data sets, and protecting against experimental misinterpretation.
While the assembly of proteome inventories was the principal driver for technologies prior to the year 2000 and continues to be important as much more powerful instruments are deployed, the most recent decade has seen the emergence of quantitative proteomics as a complementary discipline. While both sets of technologies depend upon the dissociation of peptides into fragment ions, discovery technologies generally scan the fragments produced from sampled peptide ions while targeted strategies quantify fragment ions from specified sets of precursor ions. The ion-chromatograms of these fragment intensities comprise the raw output of selected reaction monitoring (SRM, also commonly referred to as multiple reaction monitoring or MRM) experiments just as tandem mass spectra are the raw output of sampling dominant peptide precursor ions.
Discovery platforms generate tandem mass spectra from selected peptides in complex mixtures. These tandem mass spectra are then matched to lists of fragment ions predicted from sequences in reference protein databases or by direct match to qualified spectral databases. Once identifications have been filtered to an acceptably stringent error rate,13 the confident set of peptides can be assembled to derive protein-level information. Tables of spectral counts for proteins observed across several samples can be used to recognize sets of proteins that vary in concentration level between cohorts of samples.
Quality metrics can target any of several levels of data in such experiments:
These are all considered relevant metrics of quality; however, their relevance depends on the proposed application and utilization of the data. When a bioinformatics researcher applies new algorithms to instrument data, he or she may begin by replacing previous identifications with new ones. For this application, metrics are needed to confirm the quality of underlying MS and MS/MS scans. However, a cancer researcher may wish to use a list of differentially-expressed proteins from a data set and need a more comprehensive set of metrics to evaluate the quality of the initial data analysis.
SRM mass spectrometry has recently been adopted in proteomics as a targeted method for detecting and quantifying peptides. Until recently, publications featuring SRM methods have been greatly outnumbered by those enumerating proteome inventories. However, the number of researchers and publications employing SRM-based methods continues to increase. As a result, establishing data quality metrics would help to legitimize a field that would otherwise encounter challenges similar to those faced by proteomic discovery.
In considering how to develop data quality metrics, the workshop participants developed several broad questions that the proteomics community needs to answer regarding the quality of SRM data, including the following:
The consensus among the stakeholder participants concerned the need for defined quality metrics addressing the above questions. While over-standardization could prevent innovation, a lack of quality metrics and standardized annotation could cripple the comparison, portability, reproducibility, and ultimate adoption of proteomic findings and techniques.
The Sydney Workshop agreed upon the following principles to develop useful and successful proteomic data quality metrics for open access data. These principles addressing the challenges outlined above include:
All proteomics data must be accompanied with appropriate metadata that describe the biological sample, experimental procedures, instrumentation, and any software or algorithms (including version numbers) used for post-processing. Scientific reproducibility should guide the reporting of metadata, such that recipients should be able to recapitulate the analysis reported in the publication. Assessing compliance to such guidelines remains a role of journal reviewers, since they are the gatekeepers to public data in conjunction with publications.
Preventing metadata requirements from becoming too onerous will require the continued adoption of standard data formats by instrument manufacturers, data repositories, and journals. In addition to mzML7b, 25 and TraML,24 participants at the workshop called for an additional format to report analysis of chromatographic data. HUPO PSI, HUPO New Technologies Committee, and other domain-specific working groups have the charge to continue to develop and improve such formats.
Standardizing the content presentation of metadata will require a sustainable repository for proteomics. It may be necessary to establish a specialized data repository, perhaps a variant of Tranche,9, 26 PRIDE,27 GPMdb,28 or PeptideAtlas29 for quantitative data produced by SRM experiments. The need for metadata is even greater for targeted experiments since peptide sequences cannot be inferred from limited numbers of transitions. Considerations for housing such a repository should take into account an organization’s experience to develop and promote standards for databases, data deposition and exchange, and the sustainable funding model for maintenance of such biomedical databases.
Quality metrics should provide added value to both generators and users of proteomic data. For investigators, quality metrics can protect against publishing mistakes and provide validation of experimental procedures that could prove useful in seeking follow-on grants. Additionally, searchable Digital Object Identifiers (DOI) should be associated with data sets to allow for stable references over time. These links will allow investigators to reference data collections with high specificity in research papers and proposals. High quality data may be featured by the community, as in the GPMdb “Data set of the week,” 30 or by citations in papers, creating an incentive for authors to share data. Documentation of data use would provide a “value metric” for investigators, funding agencies, and data repositories.
For journal editors, metrics can enhance the quality and reliability of the manuscript review process. For repository managers, quality metrics can increase the value of stored data for use in meta-analyses and other retrospective data mining operations. Investigators in large, multi-site studies would have added confidence that data generated at one site can and should be aggregated and compared with data generated by other sites. As noted at the Amsterdam Summit, all parties producing, using, and funding proteomic data have an inherent responsibility to cooperate in ensuring both open access and high quality data.8
Reference materials and reference data are necessary for benchmarking and comparing methods in proteomics. These methods include both physical measurement platforms and analysis software. Besides providing a benchmarking process for method development, the availability of reference materials and data quality software would aid the interpretation of biological measurements by helping to answer the question, “Are the results due to noise in the measurement or sample preparation processes or due to real biological signal?”
Reference materials and software metrics have already played a role in standardizing LC-MS systems in the field of shotgun proteomics, including the UPS1 defined protein mix developed by Sigma through the 2006 sPRG study from ABRF and a yeast lysate developed by NCI’s Clinical Proteomic Technology Assessment for Cancer and NIST, with accompanying data set.18, 31 Other groups of researchers have developed workflows for greater reproducibility in SRM-MS experiments.32 Another lab has begun to develop the SRM Atlas which contains highly qualified spectral libraries of proteotypic peptide data of numerous organisms.33 In terms of reference data sets, the ISB18 dataset34 and the Aurum data set for MALDI-TOF/TOF35 were both created for the purpose of benchmarking new software tools. The ABRF Proteome Informatics Research Group (iPRG) has released two data sets specifically designed for benchmarking subtractive analysis and phosphopeptide identification.36 The HUPO Plasma Proteome Project released large volumes of data that demonstrated performance for a wide variety of instruments and protocols.37 To provide a means of comparison that exceeds a basic metric such as the number of peptides identified, software that provides metrics of the measurement process, such as NISTMSQC,18 must be accessible. Other resources are available that identify low-quality peptide identifications, providing a quantitative assessment of data quality, such as: Census,38 Colander,39 Debunker,40 DTASelect 2,14 IDPicker 41, PeptideProphet, 42 Percolator, 13a, 43 and ProteinProphet.44
The workshop did not call for adoption of a standard method or platform for every experiment but rather recognized the need for formal comparison of methods on equal footing. With no shortage of reference materials and tools for evaluation of data quality, it falls to journal editors and reviewers to ensure that published data were produced using methods assessed by appropriate reference materials and data.
The proteomics community needs to improve education and training in order to improve the quality of published methods and data. The field should develop tutorials at scientific meetings, webinars, and vendor-sponsored hands-on workshops for students and postdoctoral fellows to ensure that instruments are used correctly. These tutorials might address topics such as, “LOD and LOQ determination for SRM,” “A journal’s (or repository’s) guide on how to properly release a proteomic data set,” “Protecting against false positives in PTM identification,” and “Quality control in proteomics.” Senior academic investigators in the field should establish proteomics courses at their institutions. Journals likewise can present timely tutorials and have a critical educational role through enforcement of their guidelines. The American Society of Mass Spectrometry (ASMS), the Association of Biomolecular Resource Facilities (ABRF), and the U.S. Human Proteome Organization (US HUPO) already present a number of excellent tutorials in conjunction with their meetings. In particular, ABRF Research Groups regularly conduct community-wide experiments to facilitate method comparison 31b. The Sydney Workshop called for advancing this effort by formalizing method comparison and publishing the results.
Additionally, proteome informatics needs a community resource (such as a wiki) in which software developers and expert users can communicate standard workflows for optimal performance. Open public discussions on strategies for selecting a FASTA database for database search or choosing a “semi-tryptic” versus a “fully-tryptic” search will help reduce the competing factions within this field. Conducting comparisons of old and new revisions of software or comparing among different tools will help to characterize inevitable software change.
In summary, the Sydney Workshop identified challenges and principles in the following areas:
Addressing and implementing these will require the combined efforts of journal editors, funding agencies, trade associations, instrument manufacturers, software developers, and the community of researchers. With regard to metadata, a group of workshop participants agreed to begin drafting guidelines for SRM experiments. This group, along with the HUPO PSI and the HUPO New Technologies Committee, are working together on a draft set of guidelines for evaluation by the community. The HUPO PSI is also developing the XML-based mzQuantML format that is intended to capture the metadata and results of quantitative proteomics assays, including SRM assays. There is no shortage of reference materials; however, journals, funding agencies, and investigators have a shared responsibility to ensure their routine usage and comparisons. Use of reference materials may only increase as the field advances toward commercial applications subject to regulatory agencies, such as the FDA. The quality metrics themselves will be developed in individual labs, but software developers and instrument manufacturers who incorporate them into their products will offer a competitive advantage to the researchers who use them. Additionally, journals have a role to continue to develop and enforce policies defining data quality for the data in manuscripts for publication. As such, journal reviewers and editorial boards are the ultimate gatekeepers of data quality. Education about quality metrics will be fostered by journals, trade associations, and academic institutions as they promote proper usage of quality metrics. Finally, for funding agencies to support a proteomics data repository, all members of the proteomics community must convey the value of proteomics research to key stakeholders who would benefit from funding such a repository.
Over the past decade, the research community has made remarkable progress in developing the technologies and experimental protocols needed to make proteomics a viable science. The technologies are now poised to build on the successes of the genomics community in furthering our understanding of diseases at the molecular level. Like the genomics field before it, the proteomics field has now established sensible guidelines, increasingly adopted by its practitioners, which encourage easy public access to data. The Sydney Workshop provided questions and frameworks to develop the necessary metrics and conditions to ensure the quality of proteomics data. The successful implementation of quality metrics will require processes to protect against errors in data interpretation, incorporate “complete” metadata with each data set, and reward the teams that deposit data. Much of the burden for data quality will fall on bioinformatics researchers tasked with the development and adoption of long-term and scalable data storage solutions, metric formulation, and comparative testing strategies.
Key to the development of these data quality metrics is an understanding that over-standardization of the still evolving field of mass-spectrometry-based proteomics could hinder innovation and potential discovery. However, it is clear that basic standardization and the application of quality controls will be necessary to ensure reproducibility, reliability, and reuse of data. It is now imperative for the field to develop the metrics and methodologies to ensure that data are of the highest quality without also impeding research and innovation.
While challenges remain in defining the policies and metrics needed to assess data quality, the Sydney Workshop reaffirmed that the proteomics community has a clear interest in addressing these questions among funding agencies, journals, standards working groups, international societies, data repositories, and above all the research community. The Sydney Workshop represents an initial step toward the inclusion of more detailed metadata, the adoption of reference materials and data, the development of data quality metrics for method comparison and quality control, the education of users in the areas of proteomics and bioinformatics, and the recognition of data depositors.
*As of this writing, this requirement has been temporarily suspended due to technical difficulties associated with the presently available repositories. (http://www.mcponline.org/site/home/news/index.xhtml#rawdata).