PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Proteome Res. Author manuscript; available in PMC Feb 3, 2013.
Published in final edited form as:
PMCID: PMC3272102
NIHMSID: NIHMS343225
Recommendations for Mass Spectrometry Data Quality Metrics for Open Access Data (Corollary to the Amsterdam Principles)
Christopher R. Kinsinger,* James Apffel, Mark Baker, Xiaopeng Bian, Christoph H. Borchers, Ralph Bradshaw, Mi-Youn Brusniak, Daniel W. Chan, Eric W. Deutsch, Bruno Domon, Jeff Gorman, Rudolf Grimm, William Hancock, Henning Hermjakob, David Horn, Christie Hunter, Patrik Kolar, Hans-Joachim Kraus, Hanno Langen, Rune Linding, Robert L. Moritz, Gilbert S. Omenn, Ron Orlando, Akhilesh Pandey, Peipei Ping, Amir Rahbar, Robert Rivers, Sean L. Seymour, Richard J. Simpson, Douglas Slotta, Richard D. Smith, Stephen E. Stein, David L. Tabb, Danilo Tagle, John R. Yates, III, and Henry Rodriguez
Office of Cancer Clinical Proteomics Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892; Agilent Research Laboratories, Santa Clara, CA 95051; Department of Chemistry & Biomolecular Sciences, Macquarie University, Sydney, Australia NSW 2109; Center for Bioinformatics and Information Technology, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892; Genome BC Proteomics Centre, University of Victoria, Victoria BC, Canada V8Z 7X8; Mass Spectrometry Facility, University of California, San Francisco, CA 94143; Institute of Systems Biology, Seattle, WA 98103; Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD 21231; Institute for Systems Biology, Seattle, WA 98109; Luxembourg Clinical Proteomics Center, CRP-Sante, Strassen, Luxembourg, L-1445; Protein Discovery Centre, Queensland Institute of Medical Research, Herston, Queensland, Australia 4029; Agilent Technologies, Santa Clara, CA 95051; Department of Chemistry & Chemical Biology, Northeastern University, Boston, MA 02115; Proteomics Services, European Bioinformatics Institute, Cambridge, UK CB10 1SD; Proteomics Software Strategic Marketing, Thermo Fisher Scientific, San Jose, CA 95134; AB SCIEX, Foster City, CA 94404; Directorate-General for Research, European Commission, Brussels, Belgium, B-1049; Wiley-VCH, Weinheim, Germany, D-69469; Exploratory Biomarkers, Hoffmann-La Roche, Basel, Switzerland, 4070; The Technical University of Denmark (DTU), Cellular Signal Integration Group (C-SIG), Center for Biological Sequence Analysis (CBS), Department of Systems Biology DK-2800 Lyngby, Denmark; Cellular & Molecular Logic Unit, Institute of Systems Biology, Seattle, WA 98103; Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109; Complex Carbohydrate Research Center, University of Georgia, Athens, GA 30602; McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD 21231; David Geffen School of Medicine, University of California, Los Angeles, CA 90095; Small Business Development Center, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892; Office of Cancer Clinical Proteomics Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892; AB Sciex, Foster City, CA 94404; La Trobe institute for Molecular Science, L Trobe University, Bundoora, Victoria 3086, Australia; National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20892; Pacific Northwest National Laboratory, Richland, WA 99352; Chemical Reference Data Group, National Institute of Standards and Technology, Gaithersburg, MD 20899; Vanderbilt-Ingram Cancer Center, Nashville, TN 37232; National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892; The Scripps Research Institute, The Scripps Research Institute, La Jolla, CA 92037; Office of Cancer Clinical Proteomics Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892
* Correspondence: To whom correspondence should be addressed. Christopher R. Kinsinger, Ph.D., Office of Cancer Clinical Proteomics Research; National Cancer Institute; National Institutes of Health; 31 Center Drive, MSC 2580; Bethesda, Maryland 20892. kinsingc/at/mail.nih.gov
Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the U.S. National Cancer Institute (NCI) convened the “International Workshop on Proteomic Data Quality Metrics” in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed up on two primary needs for the wide use of quality metrics: (1) an evolving list of comprehensive quality metrics and (2) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data.
By agreement, this article is published simultaneously in the Journal of Proteome Research, Molecular and Cellular Proteomics, Proteomics, and Proteomics Clinical Applications as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.
Keywords: selected reaction monitoring, bioinformatics, data quality, metrics, open access, Amsterdam Principles, standards
On September 18, 2010, members of the international proteomics community met for a one-day workshop in Sydney, Australia, that was convened by the National Cancer Institute (NCI) of the U.S. National Institutes of Health (NIH). This workshop was held to address the lack of widely implementable policies governing the adoption and use of quality metrics for open access proteomic data, particularly concerning tandem mass spectrometry data used to identify and quantify proteins based on fragmentation of peptide ions. Parallel efforts can be implemented for protein capture-based data.
Clear data policy has aided the advance of other data-driven fields of research. Data release policies for DNA sequencing have enabled widespread access to data from the Human Genome Project and other large-scale sequencing efforts.1 The development of Minimum Information About a Microarray Experiment (MIAME) defined reporting standards for microarray experiments.2 Researchers, funding agencies, and journals would all benefit from data policy clarification in proteomics as well.
This workshop was intended to further a process that, for proteomics, began in 2004. Editors from the journal, Molecular and Cellular Proteomics (MCP), recognized the need for manuscript submission guidelines to assist reviewers and readers of the paper in understanding the methods by which the authors acquired, processed, and analyzed their data.3 This was followed by an expanded set of guidelines developed by an invited group of participants that met in Paris, France in 2005.4 These guidelines delineate elements required in manuscripts detailing LC-MS/MS proteomic inventories (such as assignment of peptides, assessment of false positive rate, and inference of proteins) and other types of experiments (such as quantitative measurements, post-translational modification assignments, and peptide-mass fingerprinting). Soon after the journal MCP adopted these “Paris Guidelines” as its standard, other journals began to adopt variants of these guidelines.5 A 2009 Philadelphia workshop revisited these guidelines to require, for the first time, deposition of raw data in a public repository.* In 2008, MCP developed the first set of guidelines for clinical proteomics papers to establish a baseline of credibility for articles in this field.6 Concomitantly, the Human Proteome Organization’s Proteomics Standards Initiative (HUPO-PSI) developed guidance for reporting data and metadata (MIAPE) and sought to standardize data formats for mass spectrometry data (mzML).7 These efforts provide a solid foundation for advancing proteomics data policy.
To complement the activities of MCP and HUPO-PSI, the NCI sponsored a 2008 International Summit at which the proteomics community produced the six Amsterdam Principles to promote open access to proteomic data8 addressing issues of: timing, comprehensiveness, format, deposition to repositories, quality metrics, and responsibility for proteomic data release. MCP has defined policies for its published articles for the principles involving timing, deposition to repositories, and responsibility for proteomic data release. MIAPE and mzML provide solid frameworks for advancing the principles of comprehensiveness and format. The principle for quality metrics, however, has not yet been matched with community-derived, comprehensive guidance. The Amsterdam Principles state that “central repositories should develop threshold metrics for assessing data quality. These metrics should be developed in a coordinated manner, both with the research community and among each other, so as to ensure interoperability. As data become shared through such repositories, their value will become obvious to the community, and momentum will grow to sustain data release.”
The NCI invited key thought-leaders and stakeholders in proteomics to the Sydney Workshop, including data producers and users, managers of database repositories, editors of scientific journals, and representatives of funding agencies. These stakeholders were asked, “How good is good enough, when it comes to MS-based proteomic data?” The answers to this question would form the quality criteria for shared data, benefiting both producers and users of proteomic data.
The workshop focused on three use cases: 1) users of public proteomic data, 2) reviewers of journal articles, and 3) multi-site, data production projects in which unpublished data are shared among laboratories. In the first case, public proteomic data may be used for a variety of applications, covering a range of prerequisites for data quality and metadata. What level of annotation and which quality metrics would help a potential user quickly determine if a data set is appropriate for the intended application? In the second case, reviewers need to determine the quality of data that accompany a submitted manuscript as a means to ascertain the strength by which the data support experimental findings in the manuscript. Apart from repeating the full analysis of the data, what set of metrics would facilitate the task of reviewers? Third, team science in the context of systems biology presents opportunities and challenges for integrating data sets generated by different labs with distinct quality assurance protocols, often on different analysis platforms. Participants in the workshop stated that standards and metrics must not be oppressive, stifling innovation and methods development and potentially hindering the evolution of proteomic technologies. However, for data produced to answer a biological question rather than to demonstrate technology, what metrics would bolster confidence that data are of sufficient quality to be integrated with experimental results spanning multiple labs and platforms?
Data from proteomics experiments can assume a variety of forms, depending on the experimental design. Evaluating data set quality is heavily dependent upon accompanying annotation. Meta-data relate the experimental design to produced data, making clear which files or records are associated with a particular sample. Protocols may be documented via Standard Operating Procedures, but most data sets are presently dependent on methods explained briefly or by reference in associated publications. Instrument data are generally considered the core of public data sets, but the format in which they are published may have a considerable impact on their usability. Derived information from these data may include raw peptide identifications or transition peak areas. Often this level is bypassed, with a more heavily processed protein spectral count table or protein ratio table presented with minimal explanation of software tools used to move from instrument data to derived information. Finally, experimental findings reflect the high-level analysis motivated by the experimental design. A list of differentially-expressed proteins, enriched networks, or biomarker candidates connects to derived information through higher-level analyses conducted in statistical or systems biology tools.
This meeting report discusses the basic principles underlying data quality metrics in proteomics and the challenges to develop and implement these principles. Its purpose is to serve as a focus of discussion and action that will guide the continuing commitment of the field to conduct high quality research and to make the product of that research useful to the global research community.
Data access and annotation
One of the major conclusions from the Amsterdam Summit was that data sharing among members of the proteomics community should be required. While there is widespread consensus on the importance of making proteomics data publicly available, the full infrastructure to do so does not yet exist. ProteomeCommons/Tranche9 and ProteomeXchange10 have been developed to meet key needs, but do not yet constitute the comprehensive system required to fully support the field. Participants at the Sydney Workshop called for a sustainably-funded repository or set of linked repositories that would work closely with journals to ensure long-term access to proteomics data.
A major problem confronting the reuse of published data files is the lack of information concerning the experiment that led to data generation. Even when experimental annotation is available, it is often cryptic or missing key information, inviting errors in mapping data to results. Contacting authors is often necessary to seek clarification.
Given public access and annotation for a data set, users should next be able to assess its quality, preferably prior to download. Data quality has become more important due to rapid evolution of technologies as well as the growing size and quantity of repository data sets. Given the option of several potential data sets, users should be able to compare among them on a common basis and evaluate their comparability or compatibility.
With major repositories joining forces to create ProteomeXchange10, quality assurance and quality control of public data sets will be necessary to separate what is useful from the rest. Moreover, research societies and funding agencies are placing importance on proteomic technologies and findings with the launch of the HUPO Human Proteome Project11, the NCI-sponsored Clinical Proteomic Technologies for Cancer (CPTC) Initiative,12 and other related initiatives, raising the importance of both data access and quality. Data quality standards will enable tools for tracking samples, measuring analytical variation, flagging incomplete data sets, and protecting against experimental misinterpretation.
Peptide to spectra matching
While the assembly of proteome inventories was the principal driver for technologies prior to the year 2000 and continues to be important as much more powerful instruments are deployed, the most recent decade has seen the emergence of quantitative proteomics as a complementary discipline. While both sets of technologies depend upon the dissociation of peptides into fragment ions, discovery technologies generally scan the fragments produced from sampled peptide ions while targeted strategies quantify fragment ions from specified sets of precursor ions. The ion-chromatograms of these fragment intensities comprise the raw output of selected reaction monitoring (SRM, also commonly referred to as multiple reaction monitoring or MRM) experiments just as tandem mass spectra are the raw output of sampling dominant peptide precursor ions.
Discovery platforms generate tandem mass spectra from selected peptides in complex mixtures. These tandem mass spectra are then matched to lists of fragment ions predicted from sequences in reference protein databases or by direct match to qualified spectral databases. Once identifications have been filtered to an acceptably stringent error rate,13 the confident set of peptides can be assembled to derive protein-level information. Tables of spectral counts for proteins observed across several samples can be used to recognize sets of proteins that vary in concentration level between cohorts of samples.
Quality metrics can target any of several levels of data in such experiments:
  • A single peptide-spectrum match (PSM) overlays a peptide sequence on a tandem mass spectrum, showing the correspondence of sequence-derived fragments with peaks observed in the MS/MS. The most straightforward quality assessment for an identification may be based in the scores produced from a search engine.14 Posterior error probabilities, on the other hand, characterize the chance of error for a given PSM.15 Alternatively, one may evaluate the quality of a spectrum without recourse to an associated sequence.16
  • Evaluating a single LC-MS/MS analysis broadens the scope to thousands of tandem mass spectra. Evaluating error rates for the aggregate of identified peptides frequently takes the form of false discovery rates (FDR) or q-values.17 Assessment of the underlying LC separation may, however, de-emphasize the identifications and focus on MS signals instead.18 While the appropriate course must be determined by the intended application of the quality metrics, certain simple metrics can almost always be expected to be useful in characterizing an LC-MS/MS experiment. These include typical mass accuracies of both precursor and fragment ions, the amount of time over which the bulk of the peptides are eluted during the LC gradient, and the typical peak width of a peptide in the time domain.
  • At the level of the overall experiment, the stability of spectral counts among technical and biological replicates is significant.19 Attempts to establish a set of “biomarker” candidate proteins require statistical models of differences,20 with each yielding a metric to characterize the degree of change observed. Experiments of relative quantitation using labeled tags, such as iTRAQ21 or SILAC have their own requirements for quality assessment. When multiple tags are assigned to each cohort, quality measurement may correlate quantities derived from redundant tags.
These are all considered relevant metrics of quality; however, their relevance depends on the proposed application and utilization of the data. When a bioinformatics researcher applies new algorithms to instrument data, he or she may begin by replacing previous identifications with new ones. For this application, metrics are needed to confirm the quality of underlying MS and MS/MS scans. However, a cancer researcher may wish to use a list of differentially-expressed proteins from a data set and need a more comprehensive set of metrics to evaluate the quality of the initial data analysis.
Selected reaction monitoring
SRM mass spectrometry has recently been adopted in proteomics as a targeted method for detecting and quantifying peptides. Until recently, publications featuring SRM methods have been greatly outnumbered by those enumerating proteome inventories. However, the number of researchers and publications employing SRM-based methods continues to increase. As a result, establishing data quality metrics would help to legitimize a field that would otherwise encounter challenges similar to those faced by proteomic discovery.
In considering how to develop data quality metrics, the workshop participants developed several broad questions that the proteomics community needs to answer regarding the quality of SRM data, including the following:
  • Are the technologies and software analytic tools available today sufficiently well-developed to compute accurate quantitative measures, along with FDRs for SRM experiments? In SRM experiments, the analyzed ion-chromatograms may have a substantial probability of measuring an ion other than the intended target. This becomes more prevalent as lower abundance proteins are targeted and the background interference is stronger. Software packages, such as mProphet22 and AuDIT23, as well as instrument-specific software developed by vendors do provide some quality-check computations, but each of these packages has limitations and few are applicable across multiple instrument vendors. Furthermore, it may be difficult to compare results generated by different software packages without some standard metrics for data quality or standard data sets by which to assess the various software packages.
  • In the absence of robust probabilities, what minimal data and metadata should be provided in order for the reported results to be objectively assessed? Should original chromatograms in an open format such as mzML7b be required for publication? Should an annotated depiction of all transitions used to make an identification be required for publication? Should attributes of the peaks, such as peak signal-to-noise and peak asymmetry, be reported? A key concern here will be finding the proper balance between the need for sufficient data and metadata to reproduce a given set of experiments and the desire to see all the data generated by those experiments. Another issue is that it may be too early to set stringent standards because of the rapid rate of technology and software innovation.
  • What are the minimal guidelines on how many peptides per protein and how many transitions per peptide are needed to ensure accurate quantitation? How should data be treated that are derived from a single peptide ion for the detection of a protein, even if only a single peptide transition is available for identification? The answer to these questions may lie with the availability of computational models that show promise in predicting how many peptides and transitions are needed to provide reliable quantitation.
  • How can standard formats, minimal information specifications, and data repositories help promote data quality, and where should these requirements be implemented? While there are standard formats for sharing SRM experiment data, such as the Proteomic Standards Initiative (PSI) mzML7b format for experimental output formats and the PSI TraML24 format for transition lists, there is still a need to develop a standard format for reporting chromatogram analysis results.
The consensus among the stakeholder participants concerned the need for defined quality metrics addressing the above questions. While over-standardization could prevent innovation, a lack of quality metrics and standardized annotation could cripple the comparison, portability, reproducibility, and ultimate adoption of proteomic findings and techniques.
The Sydney Workshop agreed upon the following principles to develop useful and successful proteomic data quality metrics for open access data. These principles addressing the challenges outlined above include:
Metadata Inclusion
Reproducibility
All proteomics data must be accompanied with appropriate metadata that describe the biological sample, experimental procedures, instrumentation, and any software or algorithms (including version numbers) used for post-processing. Scientific reproducibility should guide the reporting of metadata, such that recipients should be able to recapitulate the analysis reported in the publication. Assessing compliance to such guidelines remains a role of journal reviewers, since they are the gatekeepers to public data in conjunction with publications.
Formats
Preventing metadata requirements from becoming too onerous will require the continued adoption of standard data formats by instrument manufacturers, data repositories, and journals. In addition to mzML7b, 25 and TraML,24 participants at the workshop called for an additional format to report analysis of chromatographic data. HUPO PSI, HUPO New Technologies Committee, and other domain-specific working groups have the charge to continue to develop and improve such formats.
Repository
Standardizing the content presentation of metadata will require a sustainable repository for proteomics. It may be necessary to establish a specialized data repository, perhaps a variant of Tranche,9, 26 PRIDE,27 GPMdb,28 or PeptideAtlas29 for quantitative data produced by SRM experiments. The need for metadata is even greater for targeted experiments since peptide sequences cannot be inferred from limited numbers of transitions. Considerations for housing such a repository should take into account an organization’s experience to develop and promote standards for databases, data deposition and exchange, and the sustainable funding model for maintenance of such biomedical databases.
Recognition for Data Production
Quality metrics should provide added value to both generators and users of proteomic data. For investigators, quality metrics can protect against publishing mistakes and provide validation of experimental procedures that could prove useful in seeking follow-on grants. Additionally, searchable Digital Object Identifiers (DOI) should be associated with data sets to allow for stable references over time. These links will allow investigators to reference data collections with high specificity in research papers and proposals. High quality data may be featured by the community, as in the GPMdb “Data set of the week,” 30 or by citations in papers, creating an incentive for authors to share data. Documentation of data use would provide a “value metric” for investigators, funding agencies, and data repositories.
For journal editors, metrics can enhance the quality and reliability of the manuscript review process. For repository managers, quality metrics can increase the value of stored data for use in meta-analyses and other retrospective data mining operations. Investigators in large, multi-site studies would have added confidence that data generated at one site can and should be aggregated and compared with data generated by other sites. As noted at the Amsterdam Summit, all parties producing, using, and funding proteomic data have an inherent responsibility to cooperate in ensuring both open access and high quality data.8
Reference Materials and Reference Data for Benchmarking
Reference materials and reference data are necessary for benchmarking and comparing methods in proteomics. These methods include both physical measurement platforms and analysis software. Besides providing a benchmarking process for method development, the availability of reference materials and data quality software would aid the interpretation of biological measurements by helping to answer the question, “Are the results due to noise in the measurement or sample preparation processes or due to real biological signal?
Reference materials and software metrics have already played a role in standardizing LC-MS systems in the field of shotgun proteomics, including the UPS1 defined protein mix developed by Sigma through the 2006 sPRG study from ABRF and a yeast lysate developed by NCI’s Clinical Proteomic Technology Assessment for Cancer and NIST, with accompanying data set.18, 31 Other groups of researchers have developed workflows for greater reproducibility in SRM-MS experiments.32 Another lab has begun to develop the SRM Atlas which contains highly qualified spectral libraries of proteotypic peptide data of numerous organisms.33 In terms of reference data sets, the ISB18 dataset34 and the Aurum data set for MALDI-TOF/TOF35 were both created for the purpose of benchmarking new software tools. The ABRF Proteome Informatics Research Group (iPRG) has released two data sets specifically designed for benchmarking subtractive analysis and phosphopeptide identification.36 The HUPO Plasma Proteome Project released large volumes of data that demonstrated performance for a wide variety of instruments and protocols.37 To provide a means of comparison that exceeds a basic metric such as the number of peptides identified, software that provides metrics of the measurement process, such as NISTMSQC,18 must be accessible. Other resources are available that identify low-quality peptide identifications, providing a quantitative assessment of data quality, such as: Census,38 Colander,39 Debunker,40 DTASelect 2,14 IDPicker 41, PeptideProphet, 42 Percolator, 13a, 43 and ProteinProphet.44
The workshop did not call for adoption of a standard method or platform for every experiment but rather recognized the need for formal comparison of methods on equal footing. With no shortage of reference materials and tools for evaluation of data quality, it falls to journal editors and reviewers to ensure that published data were produced using methods assessed by appropriate reference materials and data.
Education and training
The proteomics community needs to improve education and training in order to improve the quality of published methods and data. The field should develop tutorials at scientific meetings, webinars, and vendor-sponsored hands-on workshops for students and postdoctoral fellows to ensure that instruments are used correctly. These tutorials might address topics such as, “LOD and LOQ determination for SRM,” “A journal’s (or repository’s) guide on how to properly release a proteomic data set,” “Protecting against false positives in PTM identification,” and “Quality control in proteomics.” Senior academic investigators in the field should establish proteomics courses at their institutions. Journals likewise can present timely tutorials and have a critical educational role through enforcement of their guidelines. The American Society of Mass Spectrometry (ASMS), the Association of Biomolecular Resource Facilities (ABRF), and the U.S. Human Proteome Organization (US HUPO) already present a number of excellent tutorials in conjunction with their meetings. In particular, ABRF Research Groups regularly conduct community-wide experiments to facilitate method comparison 31b. The Sydney Workshop called for advancing this effort by formalizing method comparison and publishing the results.
Additionally, proteome informatics needs a community resource (such as a wiki) in which software developers and expert users can communicate standard workflows for optimal performance. Open public discussions on strategies for selecting a FASTA database for database search or choosing a “semi-tryptic” versus a “fully-tryptic” search will help reduce the competing factions within this field. Conducting comparisons of old and new revisions of software or comparing among different tools will help to characterize inevitable software change.
In summary, the Sydney Workshop identified challenges and principles in the following areas:
Addressing and implementing these will require the combined efforts of journal editors, funding agencies, trade associations, instrument manufacturers, software developers, and the community of researchers. With regard to metadata, a group of workshop participants agreed to begin drafting guidelines for SRM experiments. This group, along with the HUPO PSI and the HUPO New Technologies Committee, are working together on a draft set of guidelines for evaluation by the community. The HUPO PSI is also developing the XML-based mzQuantML format that is intended to capture the metadata and results of quantitative proteomics assays, including SRM assays. There is no shortage of reference materials; however, journals, funding agencies, and investigators have a shared responsibility to ensure their routine usage and comparisons. Use of reference materials may only increase as the field advances toward commercial applications subject to regulatory agencies, such as the FDA. The quality metrics themselves will be developed in individual labs, but software developers and instrument manufacturers who incorporate them into their products will offer a competitive advantage to the researchers who use them. Additionally, journals have a role to continue to develop and enforce policies defining data quality for the data in manuscripts for publication. As such, journal reviewers and editorial boards are the ultimate gatekeepers of data quality. Education about quality metrics will be fostered by journals, trade associations, and academic institutions as they promote proper usage of quality metrics. Finally, for funding agencies to support a proteomics data repository, all members of the proteomics community must convey the value of proteomics research to key stakeholders who would benefit from funding such a repository.
Over the past decade, the research community has made remarkable progress in developing the technologies and experimental protocols needed to make proteomics a viable science. The technologies are now poised to build on the successes of the genomics community in furthering our understanding of diseases at the molecular level. Like the genomics field before it, the proteomics field has now established sensible guidelines, increasingly adopted by its practitioners, which encourage easy public access to data. The Sydney Workshop provided questions and frameworks to develop the necessary metrics and conditions to ensure the quality of proteomics data. The successful implementation of quality metrics will require processes to protect against errors in data interpretation, incorporate “complete” metadata with each data set, and reward the teams that deposit data. Much of the burden for data quality will fall on bioinformatics researchers tasked with the development and adoption of long-term and scalable data storage solutions, metric formulation, and comparative testing strategies.
Key to the development of these data quality metrics is an understanding that over-standardization of the still evolving field of mass-spectrometry-based proteomics could hinder innovation and potential discovery. However, it is clear that basic standardization and the application of quality controls will be necessary to ensure reproducibility, reliability, and reuse of data. It is now imperative for the field to develop the metrics and methodologies to ensure that data are of the highest quality without also impeding research and innovation.
While challenges remain in defining the policies and metrics needed to assess data quality, the Sydney Workshop reaffirmed that the proteomics community has a clear interest in addressing these questions among funding agencies, journals, standards working groups, international societies, data repositories, and above all the research community. The Sydney Workshop represents an initial step toward the inclusion of more detailed metadata, the adoption of reference materials and data, the development of data quality metrics for method comparison and quality control, the education of users in the areas of proteomics and bioinformatics, and the recognition of data depositors.
Table 1
Table 1
Recommendations for enhancing quality metrics for proteomic data
Supplementary Material
1_si_001
Footnotes
*As of this writing, this requirement has been temporarily suspended due to technical difficulties associated with the presently available repositories. (http://www.mcponline.org/site/home/news/index.xhtml#rawdata).
1. (a) [accessed May 11, 2011];Policies on Release of Human Genomic Sequence Data. http://www.ornl.gov/sci/techresources/Human_Genome/research/bermuda.shtml.(b) [accessed May 11, 2011];Sharing Data from Large-sclae Biological Research Projects: A System of Tripartite Responsibility. http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtd003207.pdf.
2. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 2001;29(4):365–71. [PubMed]
3. Carr S, Aebersold R, Baldwin M, Burlingame A, Clauser K, Nesvizhskii A. The Need for Guidelines in Publication of Peptide and Protein Identification Data: Working Group On Publication Guidelines For Peptide And Protein Identification Data. Mol Cell Proteomics. 2004;3(6):531–533. [PubMed]
4. (a) Bradshaw RA. Revised Draft Guidelines for Proteomic Data Publication. Mol Cell Proteomics. 2005;4(9):1223–1225. [PubMed](b) Bradshaw RA, Burlingame AL, Carr S, Aebersold R. Reporting Protein identification Data: The Next Generation of Guidelines. Mol Cell Proteomics. 2006;5:787–788. [PubMed]
5. Wilkins MR, Appel RD, Van Eyk JE, Chung MCM, Görg A, Hecker M, Huber LA, Langen H, Link AJ, Paik YK, Patterson SD, Pennington SR, Rabilloud T, Simpson RJ, Weiss W, Dunn MJ. Guidelines for the next 10 years of proteomics. Proteomics. 2006;6(1):4–8. [PubMed]
6. Celis JE, Carr SA, Bradshaw RA. New Guidelines for Clinical Proteomics Manuscripts. Mol Cell Proteomics. 2008;7(11):2071–2072.
7. (a) Taylor CF, Paton NW, Lilley KS, Binz PA, Julian RK, Jr, Jones AR, Zhu W, Apweiler R, Aebersold R, Deutsch EW, Dunn MJ, Heck AJ, Leitner A, Macht M, Mann M, Martens L, Neubert TA, Patterson SD, Ping P, Seymour SL, Souda P, Tsugita A, Vandekerckhove J, Vondriska TM, Whitelegge JP, Wilkins MR, Xenarios I, Yates JR, 3rd, Hermjakob H. The minimum information about a proteomics experiment (MIAPE) Nat Biotechnol. 2007;25(8):887–93. [PubMed](b) Martens L, Chambers M, Sturm M, Kessner D, Levander F, Shofstahl J, Tang WH, Rompp A, Neumann S, Pizarro AD, Montecchi-Palazzi L, Tasman N, Coleman M, Reisinger F, Souda P, Hermjakob H, Binz PA, Deutsch EW. mzML--a community standard for mass spectrometry data. Mol Cell Proteomics. 2011;10(1):R110 000133. [PubMed]
8. Rodriguez H, Snyder M, Uhlén M, Andrews P, Beavis R, Borchers C, Chalkley RJ, Cho SY, Cottingham K, Dunn M, Dylag T, Edgar R, Hare P, Heck AJR, Hirsch RF, Kennedy K, Kolar P, Kraus H-J, Mallick P, Nesvizhskii A, Ping P, Pontén F, Yang L, Yates JR, Stein SE, Hermjakob H, Kinsinger CR, Apweiler R. Recommendations from the 2008 International Summit on Proteomics Data Release and Sharing Policy: The Amsterdam Principles. J Proteome Res. 2009;8(7):3689–3692. [PMC free article] [PubMed]
9. Hill JA, Smith BE, Papoulias PG, Andrews PC. ProteomeCommons.org collaborative annotation and project management resource integrated with the Tranche repository. J Proteome Res. 2010;9(6):2809–11. [PMC free article] [PubMed]
10. Hermjakob H, Apweiler R. The Proteomics Identifications Database (PRIDE) and the ProteomExchange Consortium: making proteomics data accessible. Expert Rev Proteomics. 2006;3(1):1–3. [PubMed]
11. Legrain P, Aebersold R, Archakov A, Bairoch A, Bala K, Beretta L, Bergeron J, Borchers C, Corthals GL, Costello CE, Deutsch EW, Domon B, Hancock W, He F, Hochstrasser D, Marko-Varga G, Salekdeh GH, Sechi S, Snyder M, Srivastava S, Uhlen M, Hu CH, Yamamoto T, Paik Y-K, Omenn GS. The human proteome project: Current state and future direction. Mol Cell Proteomics. 2011 [PubMed]
12. Boja E, Hiltke T, Rivers R, Kinsinger C, Rahbar A, Mesri M, Rodriguez H. Evolution of clinical proteomics and its role in medicine. J Proteome Res. 2011;10(1):66–84. [PubMed]
13. (a) Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Meth. 2007;4(11):923–925. [PubMed](b) Farrah T, Deutsch EW, Omenn GS, Campbell DS, Sun Z, Bletz JA, Mallick P, Katz JE, Malmstrom J, Ossola R, Watts JD, Lin B, Zhang H, Moritz RL, Aebersold RH. A high-confidence human plasma proteome reference set with estimated concentrations in PeptideAtlas. Mol Cell Proteomics. 2011 [PubMed]
14. Tabb DL, McDonald WH, Yates JR., 3rd DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. J Proteome Res. 2002;1(1):21–6. [PMC free article] [PubMed]
15. Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002;74(20):5383–92. [PubMed]
16. Nesvizhskii AI, Roos FF, Grossmann J, Vogelzang M, Eddes JS, Gruissem W, Baginsky S, Aebersold R. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol Cell Proteomics. 2006;5(4):652–70. [PubMed]
17. Kall L, Storey JD, MacCoss MJ, Noble WS. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res. 2008;7(1):29–34. [PubMed]
18. Rudnick PA, Clauser KR, Kilpatrick LE, Tchekhovskoi DV, Neta P, Blonder N, Billheimer DD, Blackman RK, Bunk DM, Cardasis HL, Ham AJ, Jaffe JD, Kinsinger CR, Mesri M, Neubert TA, Schilling B, Tabb DL, Tegeler TJ, Vega-Montoto L, Variyath AM, Wang M, Wang P, Whiteaker JR, Zimmerman LJ, Carr SA, Fisher SJ, Gibson BW, Paulovich AG, Regnier FE, Rodriguez H, Spiegelman C, Tempst P, Liebler DC, Stein SE. Performance metrics for liquid chromatography-tandem mass spectrometry systems in proteomics analyses. Mol Cell Proteomics. 2010;9(2):225–41. [PMC free article] [PubMed]
19. Tabb DL, Vega-Montoto L, Rudnick PA, Variyath AM, Ham AJ, Bunk DM, Kilpatrick LE, Billheimer DD, Blackman RK, Cardasis HL, Carr SA, Clauser KR, Jaffe JD, Kowalski KA, Neubert TA, Regnier FE, Schilling B, Tegeler TJ, Wang M, Wang P, Whiteaker JR, Zimmerman LJ, Fisher SJ, Gibson BW, Kinsinger CR, Mesri M, Rodriguez H, Stein SE, Tempst P, Paulovich AG, Liebler DC, Spiegelman C. Repeatability and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry. J Proteome Res. 2010;9(2):761–76. [PMC free article] [PubMed]
20. Whiteaker JR, Zhang H, Zhao L, Wang P, Kelly-Spratt KS, Ivey RG, Piening BD, Feng LC, Kasarda E, Gurley KE, Eng JK, Chodosh LA, Kemp CJ, McIntosh MW, Paulovich AG. Integrated pipeline for mass spectrometry-based discovery and confirmation of biomarkers demonstrated in a mouse model of breast cancer. J Proteome Res. 2007;6(10):3962–75. [PubMed]
21. (a) Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, Khainovski N, Pillai S, Dey S, Daniels S, Purkayastha S, Juhasz P, Martin S, Bartlet-Jones M, He F, Jacobson A, Pappin DJ. Multiplexed Protein Quantitation in Saccharomyces cerevisiae Using Amine-reactive Isobaric Tagging Reagents. Mol Cell Proteomics. 2004;3(12):1154–1169. [PubMed](b) Liu H, Sadygov RG, Yates JR. A Model for Random Sampling and Estimation of Relative Protein Abundance in Shotgun Proteomics. Anal Chem. 2004;76(14):4193–4201. [PubMed]
22. Reiter L, Rinner O, Picotti P, Huttenhain R, Beck M, Brusniak MY, Hengartner MO, Aebersold R. mProphet: automated data processing and statistical validation for large-scale SRM experiments. Nat Meth. 2011;8(5):430–435. [PubMed]
23. Abbatiello SE, Mani DR, Keshishian H, Carr SA. Automated Detection of Inaccurate and Imprecise Transitions in Peptide Quantification by Multiple Reaction Monitoring Mass Spectrometry. Clinical Chemistry. 2009;56(2):291–305. [PMC free article] [PubMed]
24. Orchard S, Albar JP, Deutsch EW, Eisenacher M, Binz PA, Hermjakob H. implementing data standards: a report on the HUPOPSI workshop September 2009, Toronto, Canada. Proteomics. 2010;10(10):1895–8. [PubMed]
25. Deutsch E. mzML: A single, unifying data format for mass spectrometer output. Proteomics. 2008;8(14):2776–2777. [PubMed]
26. (a) Falkner JA, Andrews PC. Tranche: Secure Decentralized Data Storage for the Proteomics Community. Journal of Biomolecular Techniques. 2007;18(1):3.(b) Falkner JA, Hill JA, Andrews PC. Proteomics FASTA archive and reference resource. Proteomics. 2008;8(9):1756–7. [PubMed]
27. Vizcaino JA, Cote R, Reisinger F, Barsnes H, Foster JM, Rameseder J, Hermjakob H, Martens L. The Proteomics Identifications database: 2010 update. Nucleic Acids Res. 2010;38(Database issue):D736–42. [PMC free article] [PubMed]
28. Craig R, Cortens JP, Beavis RC. Open Source System for Analyzing, Validating, and Storing Protein Identification Data. J Proteome Res. 2004;3(6):1234–1242. [PubMed]
29. Deutsch EW, Lam H, Aebersold R. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 2008;9(5):429–434. [PubMed]
30. Beavis R. [accessed June 01, 2011];The GPM Data Set of the Week. 2011 http://www.thegpm.org/dsotw_2011.html.
31. (a) Paulovich AG, Billheimer D, Ham AJ, Vega-Montoto L, Rudnick PA, Tabb DL, Wang P, Blackman RK, Bunk DM, Cardasis HL, Clauser KR, Kinsinger CR, Schilling B, Tegeler TJ, Variyath AM, Wang M, Whiteaker JR, Zimmerman LJ, Fenyo D, Carr SA, Fisher SJ, Gibson BW, Mesri M, Neubert TA, Regnier FE, Rodriguez H, Spiegelman C, Stein SE, Tempst P, Liebler DC. Interlaboratory study characterizing a yeast performance standard for benchmarking LC-MS platform performance. Mol Cell Proteomics. 2010;9(2):242–54. [PubMed](b) Turck CW, Falick AM, Kowalak JA, Lane WS, Lilley KS, Phinney BS, Weintraub ST, Witkowska HE, Yates NA. The Association of Biomolecular Resource Facilities Proteomics Research Group 2006 study: relative protein quantitation. Mol Cell Proteomics. 2007;6(8):1291–8. [PubMed]
32. Addona TA, Abbatiello SE, Schilling B, Skates SJ, Mani DR, Bunk DM, Spiegelman CH, Zimmerman LJ, Ham AJL, Keshishian H, Hall SC, Allen S, Blackman RK, Borchers CH, Buck C, Cardasis HL, Cusack MP, Dodder NG, Gibson BW, Held JM, Hiltke T, Jackson A, Johansen EB, Kinsinger CR, Li J, Mesri M, Neubert TA, Niles RK, Pulsipher TC, Ransohoff D, Rodriguez H, Rudnick PA, Smith D, Tabb DL, Tegeler TJ, Variyath AM, Vega-Montoto LJ, Wahlander Å, Waldemarson S, Wang M, Whiteaker JR, Zhao L, Anderson NL, Fisher SJ, Liebler DC, Paulovich AG, Regnier FE, Tempst P, Carr SA. Multi-site assessment of the precision and reproducibility of multiple reaction monitoring–based measurements of proteins in plasma. Nat Biotechnol. 2009;27(7):633–641. [PMC free article] [PubMed]
33. Picotti P, Rinner O, Stallmach R, Dautel F, Farrah T, Domon B, Wenschuh H, Aebersold R. High-throughput generation of selected reaction-monitoring assays for proteins and proteomes. Nat Methods. 2010;7(1):43–6. [PubMed]
34. Klimek J, Eddes JS, Hohmann L, Jackson J, Peterson A, Letarte S, Gafken PR, Katz JE, Mallick P, Lee H, Schmidt A, Ossola R, Eng JK, Aebersold R, Martin DB. The Standard Protein Mix Database: A Diverse Data Set To Assist in the Production of Improved Peptide and Protein Identification Software Tools. J Proteome Res. 2007;7(1):96–103. [PMC free article] [PubMed]
35. Falkner J, Kachman M, Veine D, Walker A, Strahler J, Andrews P. Validated MALDI-TOF/TOF mass spectra for protein standards. J Am Soc Mass Spectrom. 2007;18(5):850–855. [PubMed]
36. (a) Askenazi M, Falkner J, Kowalak JA, Lane WS, Martens L, Meyer-Arendt K, Rudnick PA, Seymour SL, Searle BC, Tabb DL. In: ABRF iPRG 2009 E coli subtractive analysis. Tabb DL, editor. Proteome Commons; Nashville, TN: 2009. (b) Askenazi M, Clauser KR, Martens L, McDonald WH, Rudnick PA, Meyer-Arendt K, Searle BC, Lane WS, Kowalak JA, Deutsch EW, Bandiera N, Chalkley RJ. In: Study Materials for Phosphopeptide Identification. Clauser K, editor. Proteome Commons; 2010.
37. Omenn GS. The Human Proteome Organization Plasma Proteome Project pilot phase: Reference specimens, technology platform comparisons, and standardized data submissions and analyses. Proteomics. 2004;4(5):1235–1240. [PubMed]
38. Park SK, Yates JR., 3rd Census for proteome quantification. Curr Protoc Bioinformatics. 2010;Chapter 13(Unit 13):12, 1–11. [PubMed]
39. Lu B, Ruse CI, Yates JR., 3rd Colander: a probability-based support vector machine algorithm for automatic screening for CID spectra of phosphopeptides prior to database search. J Proteome Res. 2008;7(8):3628–34. [PMC free article] [PubMed]
40. Lu B, Ruse C, Xu T, Park SK, Yates J., 3rd Automatic validation of phosphopeptide identifications from tandem mass spectra. Anal Chem. 2007;79(4):1301–10. [PMC free article] [PubMed]
41. Ma ZQ, Dasari S, Chambers MC, Litton MD, Sobecki SM, Zimmerman LJ, Halvey PJ, Schilling B, Drake PM, Gibson BW, Tabb DL. IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. J Proteome Res. 2009;8(8):3872–81. [PMC free article] [PubMed]
42. Keller A, Nesvizhskii A, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002;74(20):5383. [PubMed]
43. Klammer AA, Park CY, Noble WS. Statistical Calibration of the SEQUEST XCorr Function. J Proteome Res. 2009;8(4):2106–2113. [PMC free article] [PubMed]
44. Nesvizhskii A, Keller A, Kolker E, Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem. 2003;75(17):4646. [PubMed]