Metaproteomics enables the investigation of the protein repertoire expressed by complex microbial communities. However, to unleash its full potential, refinements in bioinformatic approaches for data analysis are still needed. In this context, sequence databases selection represents a major challenge.
This work assessed the impact of different databases in metaproteomic investigations by using a mock microbial mixture including nine diverse bacterial and eukaryotic species, which was subjected to shotgun metaproteomic analysis. Then, both the microbial mixture and the single microorganisms were subjected to next generation sequencing to obtain experimental metagenomic- and genomic-derived databases, which were used along with public databases (namely, NCBI, UniProtKB/SwissProt and UniProtKB/TrEMBL, parsed at different taxonomic levels) to analyze the metaproteomic dataset. First, a quantitative comparison in terms of number and overlap of peptide identifications was carried out among all databases. As a result, only 35% of peptides were common to all database classes; moreover, genus/species-specific databases provided up to 17% more identifications compared to databases with generic taxonomy, while the metagenomic database enabled a slight increment in respect to public databases. Then, database behavior in terms of false discovery rate and peptide degeneracy was critically evaluated. Public databases with generic taxonomy exhibited a markedly different trend compared to the counterparts. Finally, the reliability of taxonomic attribution according to the lowest common ancestor approach (using MEGAN and Unipept software) was assessed. The level of misassignments varied among the different databases, and specific thresholds based on the number of taxon-specific peptides were established to minimize false positives. This study confirms that database selection has a significant impact in metaproteomics, and provides critical indications for improving depth and reliability of metaproteomic results. Specifically, the use of iterative searches and of suitable filters for taxonomic assignments is proposed with the aim of increasing coverage and trustworthiness of metaproteomic data.
The molecular chaperone Hsp90-dependent proteome represents a complex protein network of critical biological and medical relevance. Known to associate with proteins with a broad variety of functions termed clients, Hsp90 maintains key essential and oncogenic signalling pathways. Consequently, Hsp90 inhibitors are being tested as anti-cancer drugs. Using an integrated systematic approach to analyse the effects of Hsp90 inhibition in T-cells, we quantified differential changes in the Hsp90-dependent proteome, Hsp90 interactome, and a selection of the transcriptome. Kinetic behaviours in the Hsp90-dependent proteome were assessed using a novel pulse-chase strategy (Fierro-Monti et al., accompanying article), detecting effects on both protein stability and synthesis. Global and specific dynamic impacts, including proteostatic responses, are due to direct inhibition of Hsp90 as well as indirect effects. As a result, a decrease was detected in most proteins that changed their levels, including known Hsp90 clients. Most likely, consequences of the role of Hsp90 in gene expression determined a global reduction in net de novo protein synthesis. This decrease appeared to be greater in magnitude than a concomitantly observed global increase in protein decay rates. Several novel putative Hsp90 clients were validated, and interestingly, protein families with critical functions, particularly the Hsp90 family and cofactors themselves as well as protein kinases, displayed strongly increased decay rates due to Hsp90 inhibitor treatment. Remarkably, an upsurge in survival pathways, involving molecular chaperones and several oncoproteins, and decreased levels of some tumour suppressors, have implications for anti-cancer therapy with Hsp90 inhibitors. The diversity of global effects may represent a paradigm of mechanisms that are operating to shield cells from proteotoxic stress, by promoting pro-survival and anti-proliferative functions. Data are available via ProteomeXchange with identifier PXD000537.
Standard proteomics methods allow the relative quantitation of levels of thousands of proteins in two or more samples. While such methods are invaluable for defining the variations in protein concentrations which follow the perturbation of a biological system, they do not offer information on the mechanisms underlying such changes. Expanding on previous work , we developed a pulse-chase (pc) variant of SILAC (stable isotope labeling by amino acids in cell culture). pcSILAC can quantitate in one experiment and for two conditions the relative levels of proteins newly synthesized in a given time as well as the relative levels of remaining preexisting proteins. We validated the method studying the drug-mediated inhibition of the Hsp90 molecular chaperone, which is known to lead to increased synthesis of stress response proteins as well as the increased decay of Hsp90 “clients”. We showed that pcSILAC can give information on changes in global cellular proteostasis induced by treatment with the inhibitor, which are normally not captured by standard relative quantitation techniques. Furthermore, we have developed a mathematical model and computational framework that uses pcSILAC data to determine degradation constants kd and synthesis rates Vs for proteins in both control and drug-treated cells. The results show that Hsp90 inhibition induced a generalized slowdown of protein synthesis and an increase in protein decay. Treatment with the inhibitor also resulted in widespread protein-specific changes in relative synthesis rates, together with variations in protein decay rates. The latter were more restricted to individual proteins or protein families than the variations in synthesis. Our results establish pcSILAC as a viable workflow for the mechanistic dissection of changes in the proteome which follow perturbations. Data are available via ProteomeXchange with identifier PXD000538.
Global lipidomics analysis across large sample sizes produces high-content datasets that require dedicated software tools supporting lipid identification and quantification, efficient data management and lipidome visualization. Here we present a novel software-based platform for streamlined data processing, management and visualization of shotgun lipidomics data acquired using high-resolution Orbitrap mass spectrometry. The platform features the ALEX framework designed for automated identification and export of lipid species intensity directly from proprietary mass spectral data files, and an auxiliary workflow using database exploration tools for integration of sample information, computation of lipid abundance and lipidome visualization. A key feature of the platform is the organization of lipidomics data in ”database table format” which provides the user with an unsurpassed flexibility for rapid lipidome navigation using selected features within the dataset. To demonstrate the efficacy of the platform, we present a comparative neurolipidomics study of cerebellum, hippocampus and somatosensory barrel cortex (S1BF) from wild-type and knockout mice devoid of the putative lipid phosphate phosphatase PRG-1 (plasticity related gene-1). The presented framework is generic, extendable to processing and integration of other lipidomic data structures, can be interfaced with post-processing protocols supporting statistical testing and multivariate analysis, and can serve as an avenue for disseminating lipidomics data within the scientific community. The ALEX software is available at www.msLipidomics.info.
Biological applications, from genomics to ecology, deal with graphs that represents the structure of interactions. Analyzing such data requires searching for subgraphs in collections of graphs. This task is computationally expensive. Even though multicore architectures, from commodity computers to more advanced symmetric multiprocessing (SMP), offer scalable computing power, currently published software implementations for indexing and graph matching are fundamentally sequential. As a consequence, such software implementations (i) do not fully exploit available parallel computing power and (ii) they do not scale with respect to the size of graphs in the database. We present GRAPES, software for parallel searching on databases of large biological graphs. GRAPES implements a parallel version of well-established graph searching algorithms, and introduces new strategies which naturally lead to a faster parallel searching system especially for large graphs. GRAPES decomposes graphs into subcomponents that can be efficiently searched in parallel. We show the performance of GRAPES on representative biological datasets containing antiviral chemical compounds, DNA, RNA, proteins, protein contact maps and protein interactions networks.
A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge; during the curation process annotations are often propagated between entries. However, this is often not made explicit. Therefore, it can be hard, potentially impossible, for a reader to identify where an annotation originated from. Within this work we attempt to identify annotation provenance and track its subsequent propagation. Specifically, we exploit annotation reuse within the UniProt Knowledgebase (UniProtKB), at the level of individual sentences. We describe a visualisation approach for the provenance and propagation of sentences in UniProtKB which enables a large-scale statistical analysis. Initially levels of sentence reuse within UniProtKB were analysed, showing that reuse is heavily prevalent, which enables the tracking of provenance and propagation. By analysing sentences throughout UniProtKB, a number of interesting propagation patterns were identified, covering over sentences. Over sentences remain in the database after they have been removed from the entries where they originally occurred. Analysing a subset of these sentences suggest that approximately are erroneous, whilst appear to be inconsistent. These results suggest that being able to visualise sentence propagation and provenance can aid in the determination of the accuracy and quality of textual annotation.
Source code and supplementary data are available from the authors website at http://homepages.cs.ncl.ac.uk/m.j.bell1/sentence_analysis/.
Analyzing time-course expression data captured in microarray datasets is a complex undertaking as the vast and complex data space is represented by a relatively low number of samples as compared to thousands of available genes. Here, we developed the Interdependent Correlation Clustering (ICC) method to analyze relationships that exist among genes conditioned on the expression of a specific target gene in microarray data. Based on Correlation Clustering, the ICC method analyzes a large set of correlation values related to gene expression profiles extracted from given microarray datasets. ICC can be applied to any microarray dataset and any target gene. We applied this method to microarray data generated from wine fermentations and selected NSF1, which encodes a C2H2 zinc finger-type transcription factor, as the target gene. The validity of the method was verified by accurate identifications of the previously known functional roles of NSF1. In addition, we identified and verified potential new functions for this gene; specifically, NSF1 is a negative regulator for the expression of sulfur metabolism genes, the nuclear localization of Nsf1 protein (Nsf1p) is controlled in a sulfur-dependent manner, and the transcription of NSF1 is regulated by Met4p, an important transcriptional activator of sulfur metabolism genes. The inter-disciplinary approach adopted here highlighted the accuracy and relevancy of the ICC method in mining for novel gene functions using complex microarray datasets with a limited number of samples.
In liquid chromatography-mass spectrometry (LC-MS), parts of LC peaks are often corrupted by their co-eluting peptides, which results in increased quantification variance. In this paper, we propose to apply accurate LC peak boundary detection to remove the corrupted part of LC peaks. Accurate LC peak boundary detection is achieved by checking the consistency of intensity patterns within peptide elution time ranges. In addition, we remove peptides with erroneous mass assignment through model fitness check, which compares observed intensity patterns to theoretically constructed ones. The proposed algorithm can significantly improve the accuracy and precision of peptide ratio measurements.
Several approaches exist for the quantification of proteins in complex samples processed by liquid chromatography-mass spectrometry followed by fragmentation analysis (MS2). One of these approaches is label-free MS2-based quantification, which takes advantage of the information computed from MS2 spectrum observations to estimate the abundance of a protein in a sample. As a first step in this approach, fragmentation spectra are typically matched to the peptides that generated them by a search algorithm. Because different search algorithms identify overlapping but non-identical sets of peptides, here we investigate whether these differences in peptide identification have an impact on the quantification of the proteins in the sample. We therefore evaluated the effect of using different search algorithms by examining the reproducibility of protein quantification in technical repeat measurements of the same sample. From our results, it is clear that a search engine effect does exist for MS2-based label-free protein quantification methods. As a general conclusion, it is recommended to address the overall possibility of search engine-induced bias in the protein quantification results of label-free MS2-based methods by performing the analysis with two or more distinct search engines.
We present jClustering, an open framework for the design of clustering algorithms in dynamic medical imaging. We developed this tool because of the difficulty involved in manually segmenting dynamic PET images and the lack of availability of source code for published segmentation algorithms. Providing an easily extensible open tool encourages publication of source code to facilitate the process of comparing algorithms and provide interested third parties with the opportunity to review code. The internal structure of the framework allows an external developer to implement new algorithms easily and quickly, focusing only on the particulars of the method being implemented and not on image data handling and preprocessing. This tool has been coded in Java and is presented as an ImageJ plugin in order to take advantage of all the functionalities offered by this imaging analysis platform. Both binary packages and source code have been published, the latter under a free software license (GNU General Public License) to allow modification if necessary.
Human protein kinases play fundamental roles mediating the majority of signal transduction pathways in eukaryotic cells as well as a multitude of other processes involved in metabolism, cell-cycle regulation, cellular shape, motility, differentiation and apoptosis. The human protein kinome contains 518 members. Most studies that focus on the human kinome require, at some point, the visualization of large amounts of data. The visualization of such data within the framework of a phylogenetic tree may help identify key relationships between different protein kinases in view of their evolutionary distance and the information used to annotate the kinome tree. For example, studies that focus on the promiscuity of kinase inhibitors can benefit from the annotations to depict binding affinities across kinase groups. Images involving the mapping of information into the kinome tree are common. However, producing such figures manually can be a long arduous process prone to errors. To circumvent this issue, we have developed a web-based tool called Kinome Render (KR) that produces customized annotations on the human kinome tree. KR allows the creation and automatic overlay of customizable text or shape-based annotations of different sizes and colors on the human kinome tree. The web interface can be accessed at: http://bcb.med.usherbrooke.ca/kinomerender. A stand-alone version is also available and can be run locally.
Annotation; Human kinome tree; Protein kinases; Data visualisation
Summary: Automated image processing has allowed cell migration research to evolve to a high-throughput research field. As a consequence, there is now an unmet need for data management in this domain. The absence of a generic management system for the quantitative data generated in cell migration assays results in each dataset being treated in isolation, making data comparison across experiments difficult. Moreover, by integrating quality control and analysis capabilities into such a data management system, the common practice of having to manually transfer data across different downstream analysis tools will be markedly sped up and made more robust. In addition, access to a data management solution creates gateways for data standardization, meta-analysis and structured public data dissemination.
We here present CellMissy, a cross-platform data management system for cell migration data with a focus on wound healing data. CellMissy simplifies and automates data management, storage and analysis from the initial experimental set-up to data exploration.
Availability and implementation: CellMissy is a cross-platform open-source software developed in Java. Source code and cross-platform binaries are freely available under the Apache2 open source license at http://cellmissy.googlecode.com.
Supplementary data are available at Bioinformatics online.
Identifying peptides from the fragmentation spectra is a fundamental step in mass spectrometry (MS) data processing. The significance (discriminability) of every peak varies, providing additional information for potentially enhancing the identification sensitivity and the correct match rate. However this important information was not considered in previous algorithms. Here we presented a novel method based on Peptide Matching Discriminability (PMD), in which the PMD information of every peak reflects the discriminability of candidate peptides. In addition, we developed a novel peptide scoring algorithm Dispec based on PMD, by taking three aspects of discriminability into consideration: PMD, intensity discriminability and m/z error discriminability. Compared with Mascot and Sequest, Dispec identified remarkably more peptides from three experimental datasets with the same confidence at 1% PSM-level FDR. Dispec is also robust and versatile for various datasets obtained on different instruments. The concept of discriminability enhances the peptide identification and thus may contribute largely to the proteome studies. As an open-source program, Dispec is freely available at http://bioinformatics.jnu.edu.cn/software/dispec/.
A/J and 129P3/J mouse strains have different susceptibilities to dental fluorosis due to their genetic backgrounds. They also differ with respect to several features of fluoride (F) metabolism and metabolic handling of water. This study was done to determine whether differences in F metabolism could be explained by diversities in the profile of protein expression in kidneys. Weanling, male A/J mice (susceptible to dental fluorosis, n = 18) and 129P3/J mice (resistant, n = 18) were housed in pairs and assigned to three groups given low-F food and drinking water containing 0, 10 or 50 ppm [F] for 7 weeks. Renal proteome profiles were examined using 2D-PAGE and LC-MS/MS. Quantitative intensity analysis detected between A/J and 129P3/J strains 122, 126 and 134 spots differentially expressed in the groups receiving 0, 10 and 50 ppmF, respectively. From these, 25, 30 and 32, respectively, were successfully identified. Most of the proteins were related to metabolic and cellular processes, followed by response to stimuli, development and regulation of cellular processes. In F-treated groups, PDZK-1, a protein involved in the regulation of renal tubular reabsorption capacity was down-modulated in the kidney of 129P3/J mice. A/J and 129P3/J mice exhibited 11 and 3 exclusive proteins, respectively, regardless of F exposure. In conclusion, proteomic analysis was able to identify proteins potentially involved in metabolic handling of F and water that are differentially expressed or even not expressed in the strains evaluated. This can contribute to understanding the molecular mechanisms underlying genetic susceptibility to dental fluorosis, by indicating key-proteins that should be better addressed in future studies.
The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.
Drug-induced liver injury (DILI) is the leading cause of acute liver failure. Currently, no adequate predictive biomarkers for DILI are available. This study describes a translational approach using proteomic profiling for the identification of urinary proteins related to acute liver injury induced by acetaminophen (APAP). Mice were given a single intraperitoneal dose of APAP (0–350 mg/kg bw) followed by 24 h urine collection. Doses of ≥275 mg/kg bw APAP resulted in hepatic centrilobular necrosis and significantly elevated plasma alanine aminotransferase (ALT) values (p<0.0001). Proteomic profiling resulted in the identification of 12 differentially excreted proteins in urine of mice with acute liver injury (p<0.001), including superoxide dismutase 1 (SOD1), carbonic anhydrase 3 (CA3) and calmodulin (CaM), as novel biomarkers for APAP-induced liver injury. Urinary levels of SOD1 and CA3 increased with rising plasma ALT levels, but urinary CaM was already present in mice treated with high dose of APAP without elevated plasma ALT levels. Importantly, we showed in human urine after APAP intoxication the presence of SOD1 and CA3, whereas both proteins were absent in control urine samples. Urinary concentrations of CaM were significantly increased and correlated well with plasma APAP concentrations (r = 0.97; p<0.0001) in human APAP intoxicants, who did not present with elevated plasma ALT levels. In conclusion, using this urinary proteomics approach we demonstrate CA3, SOD1 and, most importantly, CaM as potential human biomarkers for APAP-induced liver injury.
We here present The Online Protein Processing Resource (TOPPR; http://iomics.ugent.be/toppr/), an online database that contains thousands of published proteolytically processed sites in human and mouse proteins. These cleavage events were identified with COmbinded FRActional DIagonal Chromatography proteomics technologies, and the resulting database is provided with full data provenance. Indeed, TOPPR provides an interactive visual display of the actual fragmentation mass spectrum that led to each identification of a reported processed site, complete with fragment ion annotations and search engine scores. Apart from warehousing and disseminating these data in an intuitive manner, TOPPR also provides an online analysis platform, including methods to analyze protease specificity and substrate-centric analyses. Concretely, TOPPR supports three ways to retrieve data: (i) the retrieval of all substrates for one or more cellular stimuli or assays; (ii) a substrate search by UniProtKB/Swiss-Prot accession number, entry name or description; and (iii) a motif search that retrieves substrates matching a user-defined protease specificity profile. The analysis of the substrates is supported through the presence of a variety of annotations, including predicted secondary structure, known domains and experimentally obtained 3D structure where available. Across substrates, substrate orthologs and conserved sequence stretches can also be shown, with iceLogo visualization provided for the latter.
Here, we present LNCipedia (http://www.lncipedia.org), a novel database for human long non-coding RNA (lncRNA) transcripts and genes. LncRNAs constitute a large and diverse class of non-coding RNA genes. Although several lncRNAs have been functionally annotated, the majority remains to be characterized. Different high-throughput methods to identify new lncRNAs (including RNA sequencing and annotation of chromatin-state maps) have been applied in various studies resulting in multiple unrelated lncRNA data sets. LNCipedia offers 21 488 annotated human lncRNA transcripts obtained from different sources. In addition to basic transcript information and gene structure, several statistics are determined for each entry in the database, such as secondary structure information, protein coding potential and microRNA binding sites. Our analyses suggest that, much like microRNAs, many lncRNAs have a significant secondary structure, in-line with their presumed association with proteins or protein complexes. Available literature on specific lncRNAs is linked, and users or authors can submit articles through a web interface. Protein coding potential is assessed by two different prediction algorithms: Coding Potential Calculator and HMMER. In addition, a novel strategy has been integrated for detecting potentially coding lncRNAs by automatically re-analysing the large body of publicly available mass spectrometry data in the PRIDE database. LNCipedia is publicly available and allows users to query and download lncRNA sequences and structures based on different search criteria. The database may serve as a resource to initiate small- and large-scale lncRNA studies. As an example, the LNCipedia content was used to develop a custom microarray for expression profiling of all available lncRNAs.
Patchy landscapes driven by human decisions and/or natural forces are still a challenge to be understood and modelled. No attempt has been made up to now to describe them by a coherent framework and to formalize landscape changing rules. Overcoming this lacuna was our first objective here, and this was largely based on the notion of Rewriting Systems, also called Formal Grammars. We used complicated scenarios of agricultural dynamics to model landscapes and to write their corresponding driving rule equations. Our second objective was to illustrate the relevance of this landscape language concept for landscape modelling through various grassland managements, with the final aim to assess their respective impacts on biological conservation. For this purpose, we made the assumptions that a higher grassland appearance frequency and higher land cover connectivity are favourable to species conservation. Ecological results revealed that dairy and beef livestock production systems are more favourable to wild species than is hog farming, although in different ways. Methodological results allowed us to efficiently model and formalize these landscape dynamics. This study demonstrates the applicability of the Rewriting System framework to the modelling of agricultural landscapes and, hopefully, to other patchy landscapes. The newly defined grammar is able to explain changes that are neither necessarily local nor Markovian, and opens a way to analytical modelling of landscape dynamics.
The original PRIDE Converter tool greatly simplified the process of submitting mass spectrometry (MS)-based proteomics data to the PRIDE database. However, after much user feedback, it was noted that the tool had some limitations and could not handle several user requirements that were now becoming commonplace. This prompted us to design and implement a whole new suite of tools that would build on the successes of the original PRIDE Converter and allow users to generate submission-ready, well-annotated PRIDE XML files. The PRIDE Converter 2 tool suite allows users to convert search result files into PRIDE XML (the format needed for performing submissions to the PRIDE database), generate mzTab skeleton files that can be used as a basis to submit quantitative and gel-based MS data, and post-process PRIDE XML files by filtering out contaminants and empty spectra, or by merging several PRIDE XML files together. All the tools have both a graphical user interface that provides a dialog-based, user-friendly way to convert and prepare files for submission, as well as a command-line interface that can be used to integrate the tools into existing or novel pipelines, for batch processing and power users. The PRIDE Converter 2 tool suite will thus become a cornerstone in the submission process to PRIDE and, by extension, to the ProteomeXchange consortium of MS-proteomics data repositories.
Targeted proteomics via selected reaction monitoring is a powerful mass spectrometric technique affording higher dynamic range, increased specificity and lower limits of detection than other shotgun mass spectrometry methods when applied to proteome analyses. However, it involves selective measurement of predetermined analytes, which requires more preparation in the form of selecting appropriate signatures for the proteins and peptides that are to be targeted. There is a growing number of software programs and resources for selecting optimal transitions and the instrument settings used for the detection and quantification of the targeted peptides, but the exchange of this information is hindered by a lack of a standard format. We have developed a new standardized format, called TraML, for encoding transition lists and associated metadata. In addition to introducing the TraML format, we demonstrate several implementations across the community, and provide semantic validators, extensive documentation, and multiple example instances to demonstrate correctly written documents. Widespread use of TraML will facilitate the exchange of transitions, reduce time spent handling incompatible list formats, increase the reusability of previously optimized transitions, and thus accelerate the widespread adoption of targeted proteomics via selected reaction monitoring.
The growing interest in the field of proteomics has increased the demand for software tools and applications that process and analyze the resulting data. And even though the purpose of these tools can vary significantly, they usually share a basic set of features, including the handling of protein and peptide sequences, the visualization of (and interaction with) spectra and chromatograms, and the parsing of results from various proteomics search engines. Developers typically spend considerable time and effort implementing these support structures, which detracts from working on the novel aspects of their tool.
In order to simplify the development of proteomics tools, we have implemented an open-source support library for computational proteomics, called compomics-utilities. The library contains a broad set of features required for reading, parsing, and analyzing proteomics data. compomics-utilities is already used by a long list of existing software, ensuring library stability and continued support and development.
As a user-friendly, well-documented and open-source library, compomics-utilities greatly simplifies the implementation of the basic features needed in most proteomics tools. Implemented in 100% Java, compomics-utilities is fully portable across platforms and architectures. Our library thus allows the developers to focus on the novel aspects of their tools, rather than on the basic functions, which can contribute substantially to faster development, and better tools for proteomics.
Despite the fact that data deposition is not a generalised fact yet in the field of proteomics, several mass spectrometry (MS) based proteomics repositories are publicly available for the scientific community. The main existing resources are: the Global Proteome Machine Database (GPMDB), PeptideAtlas, the PRoteomics IDEntifications database (PRIDE), Tranche, and NCBI Peptidome. In this review the capabilities of each of these will be described, paying special attention to four key properties: data types stored, applicable data submission strategies, supported formats, and available data mining and visualization tools. Additionally, the data contents from model organisms will be enumerated for each resource. There are other valuable smaller and/or more specialized repositories but they will not be covered in this review. Finally, the concept behind the ProteomeXchange consortium, a collaborative effort among the main resources in the field, will be introduced.
CV, Controlled Vocabulary; HGNC, HUGO Gene Nomenclature Committee; MCP, Molecular and Cellular Proteomics; MRM, Multiple Reaction Monitoring; NIH, National Institutes of Health; OLS, Ontology Lookup Service; PICR, Protein Identifier Cross-Referencing; PSI, Proteomics Standards Initiative; QC, Quality Control; SRM, Selected Reaction Monitoring; SBEAMS, Systems Biology Experiment Analysis Management System; TPP, Trans Proteomics Pipeline.; Proteomics; Databases; Bioinformatics; Data standards; Repositories