Every data-rich community research effort requires a clear plan for ensuring the quality of the data interpretation and comparability of analyses. To address this need within the Human Proteome Project (HPP) of the Human Proteome Organization (HUPO), we have developed through broad consultation a set of mass spectrometry data interpretation guidelines that should be applied to all HPP data contributions. For submission of manuscripts reporting HPP protein identification results, the guidelines are presented as a one-page checklist containing fifteen essential points followed by two pages of expanded description of each. Here, we present an overview of the guidelines and provide an in-depth description of each of the fifteen elements to facilitate understanding of the intentions and rationale behind the guidelines, both for authors and for reviewers. Broadly, these guidelines provide specific directions regarding how HPP data are to be submitted to mass spectrometry data repositories, how error analysis should be presented, and how detection of novel proteins should be supported with additional confirmatory evidence. These guidelines, developed by the HPP community, are presented to the broader scientific community for further discussion.
Guidelines; standards; Human Proteome Project; mass spectrometry; false-discovery rates; alternative protein matches
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances – a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ~20,000 primary isoforms plus contaminants to a very large database that includes almost all non-redundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/.
shotgun mass spectrometry; search databases; human
Modern biomedical data collection is generating exponentially more data in a multitude of formats. This flood of complex data poses significant opportunities to discover and understand the critical interplay among such diverse domains as genomics, proteomics, metabolomics, and phenomics, including imaging, biometrics, and clinical data. The Big Data for Discovery Science Center is taking an “-ome to home” approach to discover linkages between these disparate data sources by mining existing databases of proteomic and genomic data, brain images, and clinical assessments. In support of this work, the authors developed new technological capabilities that make it easy for researchers to manage, aggregate, manipulate, integrate, and model large amounts of distributed data. Guided by biological domain expertise, the Center’s computational resources and software will reveal relationships and patterns, aiding researchers in identifying biomarkers for the most confounding conditions and diseases, such as Parkinson’s and Alzheimer’s.
big; biomedical; data; resource; discovery; science, neuroscience (ja); big data; analytics; BD2K; discovery science; Parkinson's disease; Alzheimer's disease (ID)
To provide new and expanded proteome documentation of the opportunistically pathogen Candida albicans, we have developed new protein extraction and analysis routines to provide a new, extended and enhanced version of the C. albicans PeptideAtlas. Two new datasets, resulting from experiments consisting of exhaustive subcellular fractionations and different growing conditions, plus two additional datasets from previous experiments on the surface and the secreted proteomes, have been incorporated to increase the coverage of the proteome. High resolution precursor mass spectrometry (MS) and ion trap tandem MS spectra were analyzed with three different search engines using a database containing allele-specific sequences. This approach, novel for a large-scale C. albicans proteomics project, was combined with the post-processing and filtering implemented in the Trans Proteomic Pipeline consistently used in the PeptideAtlas project and resulted in 49372 additional peptides (3-fold increase) and 1630 more proteins (1.6-fold increase) identified in the new C. albicans PeptideAtlas with respect to the previous build. A total of 71310 peptides and 4174 canonical (minimal non-redundant set) proteins (4115 if one protein per pair of alleles is considered) were identified representing 66% of the 6218 proteins in the predicted proteome. This makes the new PeptideAtlas build the most comprehensive C. albicans proteomics resource available and the only large-scale one with detections of individual alleles.
The Biology and Disease-driven Human Proteome Project (B/D-HPP) is aimed at supporting and enhancing the broad use of state-of-the-art proteomic methods to characterize and quantify proteins for in-depth understanding of the molecular mechanisms of biological processes and human disease. Based on a foundation of the pre-existing HUPO initiatives begun in 2002, the B/D-HPP is designed to provide standardized methods and resources for mass spectrometry and specific protein affinity reagents and facilitate accessibility of these resources to the broader life sciences research and clinical communities. Currently there are 22 B/D-HPP initiatives and 3 closely related HPP resource pillars. The B/D-HPP groups are working to define sets of protein targets that are highly relevant to each particular field to deliver relevant assays for the measurement of these selected targets and to disseminate and make publicly accessible the information and tools generated. Major developments are the 2016 publications of the Human SRM Atlas and of “popular protein sets” for six organ systems. Here we present the current activities and plans of the BD-HPP initiatives as highlighted in numerous B/D-HPP workshops at the 14th annual HUPO 2015 World Congress of Proteomics in Vancouver, Canada.
Biology and Disease-driven Human Proteome Project; B/D-HPP; selected reaction monitoring–MS; SRM–MS; mass spectrometry; MS
The HUPO Human Proteome Project (HPP) has two overall goals: (1) stepwise completion of the protein parts list—the draft human proteome including confidently identifying and characterizing at least one protein product from each protein-coding gene, with increasing emphasis on sequence variants, post-translational modifications (PTMs), and splice isoforms of those proteins; and (2) making proteomics an integrated counterpart to genomics throughout the biomedical and life sciences community. PeptideAtlas and GPMDB reanalyze all major human mass spectrometry data sets available through ProteomeXchange with standardized protocols and stringent quality filters; neXtProt curates and integrates mass spectrometry and other findings to present the most up to date authorative compendium of the human proteome. The HPP Guidelines for Mass Spectrometry Data Interpretation version 2.1 were applied to manuscripts submitted for this 2016 C-HPP-led special issue [www.thehpp.org/guidelines]. The Human Proteome presented as neXtProt version 2016-02 has 16,518 confident protein identifications (Protein Existence [PE] Level 1), up from 13,664 at 2012-12, 15,646 at 2013-09, and 16,491 at 2014-10. There are 485 proteins that would have been PE1 under the Guidelines v1.0 from 2012 but now have insufficient evidence due to the agreed-upon more stringent Guidelines v2.0 to reduce false positives. neXtProt and PeptideAtlas now both require two non-nested, uniquely mapping (proteotypic) peptides of at least 9 aa in length. There are 2,949 missing proteins (PE2+3+4) as the baseline for submissions for this fourth annual C-HPP special issue of Journal of Proteome Research. PeptideAtlas has 14,629 canonical (plus 1187 uncertain and 1755 redundant) entries. GPMDB has 16,190 EC4 entries, and the Human Protein Atlas has 10,475 entries with supportive evidence. neXtProt, PeptideAtlas, and GPMDB are rich resources of information about post-translational modifications (PTMs), single amino acid variants (SAAVSs), and splice isoforms. Meanwhile, the Biology- and Disease-driven (B/D)-HPP has created comprehensive SRM resources, generated popular protein lists to guide targeted proteomics assays for specific diseases, and launched an Early Career Researchers initiative.
metrics; guidelines; neXtProt; PeptideAtlas; GPMDB; Human Protein Atlas; PTMs (post-translational modifications); N-termini; SAAV (single amino acid variants); splice isoforms
The ProteomeXchange (PX) Consortium of proteomics resources (http://www.proteomexchange.org) was formally started in 2011 to standardize data submission and dissemination of mass spectrometry proteomics data worldwide. We give an overview of the current consortium activities and describe the advances of the past few years. Augmenting the PX founding members (PRIDE and PeptideAtlas, including the PASSEL resource), two new members have joined the consortium: MassIVE and jPOST. ProteomeCentral remains as the common data access portal, providing the ability to search for data sets in all participating PX resources, now with enhanced data visualization components.
We describe the updated submission guidelines, now expanded to include four members instead of two. As demonstrated by data submission statistics, PX is supporting a change in culture of the proteomics field: public data sharing is now an accepted standard, supported by requirements for journal submissions resulting in public data release becoming the norm. More than 4500 data sets have been submitted to the various PX resources since 2012. Human is the most represented species with approximately half of the data sets, followed by some of the main model organisms and a growing list of more than 900 diverse species. Data reprocessing activities are becoming more prominent, with both MassIVE and PeptideAtlas releasing the results of reprocessed data sets. Finally, we outline the upcoming advances for ProteomeXchange.
Objective To describe the goals of the Proteomics Standards Initiative (PSI) of the Human Proteome Organization, the methods that the PSI has employed to create data standards, the resulting output of the PSI, lessons learned from the PSI’s evolution, and future directions and synergies for the group.
Materials and Methods The PSI has 5 categories of deliverables that have guided the group. These are minimum information guidelines, data formats, controlled vocabularies, resources and software tools, and dissemination activities. These deliverables are produced via the leadership and working group organization of the initiative, driven by frequent workshops and ongoing communication within the working groups. Official standards are subjected to a rigorous document process that includes several levels of peer review prior to release.
Results We have produced and published minimum information guidelines describing what information should be provided when making data public, either via public repositories or other means. The PSI has produced a series of standard formats covering mass spectrometer input, mass spectrometer output, results of informatics analysis (both qualitative and quantitative analyses), reports of molecular interaction data, and gel electrophoresis analyses. We have produced controlled vocabularies that ensure that concepts are uniformly annotated in the formats and engaged in extensive software development and dissemination efforts so that the standards can efficiently be used by the community.
Conclusion In its first dozen years of operation, the PSI has produced many standards that have accelerated the field of proteomics by facilitating data exchange and deposition to data repositories. We look to the future to continue developing standards for new proteomics technologies and workflows and mechanisms for integration with other omics data types. Our products facilitate the translation of genomics and proteomics findings to clinical and biological phenotypes. The PSI website can be accessed at http://www.psidev.info.
standards; data standards; data formats; guidelines; proteomics; standards organization; HUPO; proteomics standards initiative
Most shotgun proteomics data analysis workflows are based on the assumption that each fragment ion spectrum is explained by a single species of peptide ion isolated by the mass spectrometer; however, in reality mass spectrometers often isolate more than one peptide ion within the window of isolation that contributes to additional peptide fragment peaks in many spectra. We present a new tool called reSpect, implemented in the Trans-Proteomic Pipeline (TPP), that enables an iterative workflow whereby fragment ion peaks explained by a peptide ion identified in one round of sequence searching or spectral library search are attenuated based on the confidence of the identification, and then the altered spectrum is subjected to further rounds of searching. The reSpect tool is not implemented as a search engine, but rather as a post search engine processing step where only fragment ion intensities are altered. This enables the application of any search engine combination in the following iterations. Thus, reSpect is compatible with all other protein sequence database search engines as well as peptide spectral library search engines that are supported by the TPP. We show that while some datasets are highly amenable to chimeric spectrum identification and lead to additional peptide identification boosts of over 30% with as many as four different peptide ions identified per spectrum, datasets with narrow precursor ion selection only benefit from such processing at the level of a few percent. We demonstrate a technique that facilitates the determination of the degree to which a dataset would benefit from chimeric spectrum analysis. The reSpect tool is free and open source, provided within the TPP and available at the TPP website.
A unique archive of Big Data on Parkinson’s Disease is collected, managed and disseminated by the Parkinson’s Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson’s disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data–large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources–all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data.
Methods and Findings
Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson’s disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting.
Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson’s disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer’s, Huntington’s, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.
Democratization of genomics technologies has enabled the rapid determination of genotypes. More recently the democratization of comprehensive proteomics technologies is enabling the determination of the cellular phenotype and the molecular events that define its dynamic state. Core proteomic technologies include mass spectrometry to define protein sequence, protein:protein interactions, and protein post-translational modifications. Key enabling technologies for proteomics are bioinformatic pipelines to identify, quantitate, and summarize these events. The Trans-Proteomics Pipeline (TPP) is a robust open-source standardized data processing pipeline for large-scale reproducible quantitative mass spectrometry proteomics. It supports all major operating systems and instrument vendors via open data formats. Here we provide a review of the overall proteomics workflow supported by the TPP, its major tools, and how it can be used in its various modes from desktop to cloud computing. We describe new features for the TPP, including data visualization functionality. We conclude by describing some common perils that affect the analysis of tandem mass spectrometry datasets, as well as some major upcoming features.
bioinformatics; mass spectrometry
We present a novel mass spectrometry-based high-throughput workflow and an open-source computational and data resource to reproducibly identify and quantify HLA-associated peptides. Collectively, the resources support the generation of HLA allele-specific peptide assay libraries consisting of consensus fragment ion spectra, and the analysis of quantitative digital maps of HLA peptidomes generated from a range of biological sources by SWATH mass spectrometry (MS). This study represents the first community-based effort to develop a robust platform for the reproducible and quantitative measurement of the entire repertoire of peptides presented by HLA molecules, an essential step towards the design of efficient immunotherapies.
The cells of the immune system protect us by recognizing telltale molecules produced by damaged and diseased cells, or by infection-causing microorganisms (which are also called pathogens). To help with this process, the cells in our bodies display small fragments of proteins (called peptides) on their surface that are then checked by the immune cells. Collectively, these peptides are referred to as the ‘immunopeptidome’, and deciphering the complexity of the human immunopeptidome is important for both basic research and medical science. Such an achievement would help to guide the development of next-generation vaccines and therapies against autoimmune disorders, infectious diseases and cancers.
In the past, immune peptides were mostly identified using a technique that is commonly called ‘shotgun’ mass spectrometry. However, this approach doesn't always provide reproducible results. In 2012, researchers reported the development of a new approach—which they called ‘SWATH’ mass spectrometry—that could yield more reproducible data.
Now, Caron et al.—including many of the researchers involved in the 2012 study—have developed a large collection of standardized tests that use SWATH mass spectrometry to analyze the human immunopeptidome. The workflow and the computational and data resources developed as part of this international effort are the first steps toward highly reproducible and measurable analyses of the immunopeptidome across many samples. Moreover, the large repository of assays generated by the project has been made public and will serve a large community of researchers, which should enable better collaborations.
In the future, SWATH mass spectrometry could be used as a robust technology for the reproducible detection and measurement of pathogen-specific or cancer-specific immune peptides. This could greatly help in the design of personalized immune-based therapies.
human leukocytes antigen; immunopeptidome; targeted mass spectrometry; SWATH-MS; DIA; human
Biological research of Sus scrofa, the domestic pig, is of
immediate relevance for food production sciences, and for developing pig as a model
organism for human biomedical research. Publicly available data repositories play a
fundamental role for all biological sciences, and protein data repositories are in
particular essential for the successful development of new proteomic methods. Cumulative
proteome data repositories, including the PeptideAtlas, provide the means for targeted
proteomics, system wide observations, and cross species observational studies, but pigs
have so far been underrepresented in existing repositories.
We here present a significantly improved build of the Pig PeptideAtlas, which
includes pig proteome data from 25 tissues and three body fluid types mapped to 7139
canonical proteins. The content of the Pig PeptideAtlas reflects actively ongoing research
within the veterinary proteomics domain, and this manuscript demonstrates how the
expression of isoform-unique peptides can be observed across distinct tissues and body
The Pig PeptideAtlas is a unique resource for use in animal proteome research,
particularly biomarker discovery and for preliminary design of SRM assays, which are
equally important for progress in research that supports farm animal production and
veterinary health, as for developing pig models with relevance to human health
Animal model; biomarker; Pig PeptideAtlas; proteomics; repositories
Proteome information resources of farm animals are lagging behind those of the classical model organisms despite their important biological and economic relevance. Here we present a Bovine PeptideAtlas, representing a first collection of bovine proteomics datasets within the PeptideAtlas framework. This database was built primarily as a source of information for designing selected reaction monitoring assays for studying milk production and mammary gland health, but it has an intrinsic general value for the farm animal research community. The Bovine PeptideAtlas comprises 1921 proteins at 1.2% false discovery rate (FDR) and 8559 distinct peptides at 0.29% FDR identified in 107 samples from 6 tissues. The PeptideAtlas web interface has a rich set of visualization and data exploration tools, enabling users to interactively mine information about individual proteins and peptides, their genome mappings, and supporting spectral evidence.
Bos Taurus; mammary gland proteome; milk proteome; PeptideAtlas; proteotypic peptides
The Human PeptideAtlas is a compendium of the highest quality peptide identifications from over 1000 shotgun mass spectrometry proteomics experiments collected from many different labs, all reanalyzed through a uniform processing pipeline. The latest 2015-03 build contains substantially more input data than past releases, is mapped to a recent version of our merged reference proteome, and uses improved informatics processing and the development of the AtlasProphet to provide the highest quality results. Within the set of ~20,000 neXtProt primary entries, 14,070 (70%) are confidently detected in the latest build, 5% are ambiguous, 9% are redundant, leaving the total percentage of proteins for which there are no mapping detections at just 16% (3166), all derived from over 133 million peptide-spectrum matches identifying more than 1 million distinct peptides using AtlasProphet to characterize and classify the protein matches. Improved handling for detection and presentation of single amino-acid variants (SAAVs) reveals the detection of 5,326 uniquely mapping SAAVs across 2,794 proteins. With such a large amount of data, the control of false positives is a challenge. We present the methodology and results for maintaining rigorous quality, along with a discussion of the implications of the remaining sources of errors in the build. We check our uncertainty estimates against a set of olfactory receptor proteins not expected to be present in the set. We show how the use of synthetic reference spectra can provide confirmatory evidence for claims of detection of proteins with weak evidence.
shotgun proteomics; tandem mass spectrometry; repositories; PeptideAtlas; Human Proteome Project; observed proteome
Remarkable progress continues on the annotation of the proteins identified in the Human Proteome and on finding credible proteomic evidence for the expression of “missing proteins”. Missing proteins are those with no previous protein-level evidence or insufficient evidence to make a confident identification upon reanalysis in PeptideAtlas and curation in neXtProt. Enhanced with several major new data sets published in 2014, the human proteome presented as neXtProt, version 2014-09-19, has 16 491 unique confident proteins (PE level 1), up from 13 664 at 2012-12 and 15 646 at 2013-09. That leaves 2948 missing proteins from genes classified having protein existence level PE 2, 3, or 4, as well as 616 dubious proteins at PE 5. Here, we document the progress of the HPP and discuss the importance of assessing the quality of evidence, confirming automated findings and considering alternative protein matches for spectra and peptides. We provide guidelines for proteomics investigators to apply in reporting newly identified proteins.
Human Proteome Project; HPP metrics; guidelines; high-confidence protein identifications; neXtProt; PeptideAtlas; Human Protein Atlas; Global Proteome Machine database (GPMDB); missing proteins; novel proteins
Public repositories for proteomics data have accelerated proteomics research by enabling more efficient cross-analyses of datasets, supporting the creation of protein and peptide compendia of experimental results, supporting the development and testing of new software tools, and facilitating the manuscript review process. The repositories available to date have been designed to accommodate either shotgun experiments or generic proteomic data files. Here, we describe a new kind of proteomic data repository for the collection and representation of data from selected reaction monitoring (SRM) measurements. The PeptideAtlas SRM Experiment Library (PASSEL) allows researchers to easily submit proteomic data sets generated by SRM. The raw data are automatically processed in a uniform manner and the results are stored in a database, where they may be downloaded or browsed via a web interface that includes a chromatogram viewer. PASSEL enables cross-analysis of SRM data, supports optimization of SRM data collection, and facilitates the review process of SRM data. Further, PASSEL will help in the assessment of proteotypic peptide performance in a wide array of samples containing the same peptide, as well as across multiple experimental protocols.
data repository; MRM; software; SRM; targeted proteomics
The application of mass spectrometry (MS) to the analysis of proteomes has enabled the high-throughput identification and abundance measurement of hundreds to thousands of proteins per experiment. However, the formidable informatics challenge associated with analyzing MS data has required a wide variety of data file formats to encode the complex data types associated with MS workflows. These formats encompass the encoding of input instruction for instruments, output products of the instruments, and several levels of information and results used by and produced by the informatics analysis tools. A brief overview of the most common file formats in use today is presented here, along with a discussion of related topics.
PeptideAtlas, SRMAtlas and PASSEL are web-accessible resources to support discovery and targeted proteomics research. PeptideAtlas is a multi-species compendium of shotgun proteomic data provided by the scientific community, SRMAtlas is a resource of high-quality, complete proteome SRM assays generated in a consistent manner for the targeted identification and quantification of proteins, and PASSEL is a repository that compiles and represents selected reaction monitoring data, all in an easy to use interface. The databases are generated from native mass spectrometry data files that are analyzed in a standardized manner including statistical validation of the results. Each resource offers search functionalities and can be queried by user defined constraints; the query results are provided in tables or are graphically displayed. PeptideAtlas, SRMAtlas and PASSEL are publicly available freely via the website http://www.peptideatlas.org. In this protocol, we describe the use of these resources, we highlight how to submit, search, collate and download data.
discovery proteomics; targeted proteomics; selected reaction monitoring (SRM); data repository; data resource; complete proteome library
Targeted proteomics via selected reaction monitoring is a powerful mass spectrometric technique affording higher dynamic range, increased specificity and lower limits of detection than other shotgun mass spectrometry methods when applied to proteome analyses. However, it involves selective measurement of predetermined analytes, which requires more preparation in the form of selecting appropriate signatures for the proteins and peptides that are to be targeted. There is a growing number of software programs and resources for selecting optimal transitions and the instrument settings used for the detection and quantification of the targeted peptides, but the exchange of this information is hindered by a lack of a standard format. We have developed a new standardized format, called TraML, for encoding transition lists and associated metadata. In addition to introducing the TraML format, we demonstrate several implementations across the community, and provide semantic validators, extensive documentation, and multiple example instances to demonstrate correctly written documents. Widespread use of TraML will facilitate the exchange of transitions, reduce time spent handling incompatible list formats, increase the reusability of previously optimized transitions, and thus accelerate the widespread adoption of targeted proteomics via selected reaction monitoring.
One purpose of the biomedical literature is to report results in sufficient detail so that the methods of data collection and analysis can be independently replicated and verified. Here we present for consideration a minimum information specification for gene expression localization experiments, called the “Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments (MISFISHIE)”. It is modelled after the MIAME (Minimum Information About a Microarray Experiment) specification for microarray experiments. Data specifications like MIAME and MISFISHIE specify the information content without dictating a format for encoding that information. The MISFISHIE specification describes six types of information that should be provided for each experiment: Experimental Design, Biomaterials and Treatments, Reporters, Staining, Imaging Data, and Image Characterizations. This specification has benefited the consortium within which it was initially developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal.
Data standardization; Human Proteome Organisation; Proteomics Standards Initiative
Mass spectrometry information has long offered the potential of discovering biomarkers that would enable clinicians to diagnose disease, and treat it with targeted therapies. Hundreds of human samples alone have been used to generate thousands of spectra for identification. This data, and the generation of targeted peptide information, represents the first step in the process of locating disease biomarkers. Reaching the goal of clinical proteomics requires that this data be integrated with additional information from disease literature and genomic studies. Here we describe PeptideAtlas and associated methods for mining the data, as well as the software tools necessary to support large-scale integration and mining.
SRM; Mass spectrometry; proteomic; visualization; data mining
Progress in mass spectrometry-based methods for veterinary research and diagnostics is lagging behind compared to the human research, and proteome data of domestic animals is still not well represented in open source data repositories. This is particularly true for the equine species. Here we present a first Equine PeptideAtlas encompassing high-resolution tandem mass spectrometry analyses of 51 samples representing a selection of equine tissues and body fluids from healthy and diseased animals. The raw data were processed through the Trans-Proteomic Pipeline to yield high quality identification of proteins and peptides. The current release comprises 24,131 distinct peptides representing 2636 canonical proteins observed at false discovery rates of 0.2 % at the peptide level and 1.4 % at the protein level. Data from the Equine PeptideAtlas are available for experimental planning, validation of new datasets, and as a proteomic data mining resource. The advantages of the Equine PeptideAtlas are demonstrated by examples of mining the contents for information on potential and well-known equine acute phase proteins, which have extensive general interest in the veterinary clinic. The extracted information will support further analyses, and emphasises the value of the Equine PeptideAtlas as a resource for the design of targeted quantitative proteomic studies.
Acute phase proteins; Animal proteomics; Equine; PeptideAtlas; Proteotypic peptides