Confident identification of peptides via tandem mass spectrometry underpins modern high-throughput proteomics. This has motivated considerable recent interest in the post-processing of search engine results to increase confidence and calculate robust statistical measures, for example through the use of decoy databases to calculate false discovery rates (FDR). FDR-based analyses allow for multiple testing and can assign a single confidence value for both sets and individual peptide spectrum matches (PSMs). We recently developed an algorithm for combining the results from multiple search engines, integrating FDRs for sets of PSMs made by different search engine combinations. Here we describe a web-server, and a downloadable application, which makes this routinely available to the proteomics community. The web server offers a range of outputs including informative graphics to assess the confidence of the PSMs and any potential biases. The underlying pipeline provides a basic protein inference step, integrating PSMs into protein ambiguity groups where peptides can be matched to more than one protein. Importantly, we have also implemented full support for the mzIdentML data standard, recently released by the Proteomics Standards Initiative, providing users with the ability to convert native formats to mzIdentML files, which are available to download.
bioinformatics; false discovery rate; multiple search engines; web server; data standards
The killer cell immunoglobulin-like receptors (KIR) play a fundamental role in the innate immune system, through their interactions with human leucocyte antigen (HLA) molecules, leading to the modulation of activity in natural killer (NK) cells, mainly related to killing pathogen-infected cells. KIR genes are hugely polymorphic both in the number of genes an individual carries and in the number of alleles identified. We have previously developed the Allele Frequency Net Database (AFND, http://www.allelefrequencies.net), which captures worldwide frequencies of alleles, genes and haplotypes for several immune genes, including KIR genes, in healthy populations, covering >4 million individuals. Here, we report the creation of a new database within AFND, named KIR and Diseases Database (KDDB), capturing a large quantity of data derived from publications in which KIR genes, alleles, genotypes and/or haplotypes have been associated with infectious diseases (e.g. hepatitis C, HIV, malaria), autoimmune disorders (e.g. type I diabetes, rheumatoid arthritis), cancer and pregnancy-related complications. KDDB has been created through an extensive manual curation effort, extracting data on more than a thousand KIR-disease records, comprising >50 000 individuals. KDDB thus provides a new community resource for understanding not only how KIR genes are associated with disease, but also, by working in tandem with the large data sets already present in AFND, where particular genes, genotypes or haplotypes are present in worldwide populations or different ethnic groups. We anticipate that KDDB will be an important resource for researchers working in immunogenetics.
Controlled vocabularies (CVs), i.e. a collection of predefined terms describing a modeling domain, used for the semantic annotation of data, and ontologies are used in structured data formats and databases to avoid inconsistencies in annotation, to have a unique (and preferably short) accession number and to give researchers and computer algorithms the possibility for more expressive semantic annotation of data. The Human Proteome Organization (HUPO)–Proteomics Standards Initiative (PSI) makes extensive use of ontologies/CVs in their data formats. The PSI-Mass Spectrometry (MS) CV contains all the terms used in the PSI MS–related data standards. The CV contains a logical hierarchical structure to ensure ease of maintenance and the development of software that makes use of complex semantics. The CV contains terms required for a complete description of an MS analysis pipeline used in proteomics, including sample labeling, digestion enzymes, instrumentation parts and parameters, software used for identification and quantification of peptides/proteins and the parameters and scores used to determine their significance. Owing to the range of topics covered by the CV, collaborative development across several PSI working groups, including proteomics research groups, instrument manufacturers and software vendors, was necessary. In this article, we describe the overall structure of the CV, the process by which it has been developed and is maintained and the dependencies on other ontologies.
Database URL: http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo
The Library of Apicomplexan Metabolic Pathways (LAMP, http://www.llamp.net) is a web database that provides near complete mapping from genes to the central metabolic functions for some of the prominent intracellular parasites of the phylum Apicomplexa. This phylum includes the causative agents of malaria, toxoplasmosis and theileriosis—diseases with a huge economic and social impact. A number of apicomplexan genomes have been sequenced, but the accurate annotation of gene function remains challenging. We have adopted an approach called metabolic reconstruction, in which genes are systematically assigned to functions within pathways/networks for Toxoplasma gondii, Neospora caninum, Cryptosporidium and Theileria species, and Babesia bovis. Several functions missing from pathways have been identified, where the corresponding gene for an essential process appears to be absent from the current genome annotation. For each species, LAMP contains interactive diagrams of each pathway, hyperlinked to external resources and annotated with detailed information, including the sources of evidence used. We have also developed a section to highlight the overall metabolic capabilities of each species, such as the ability to synthesize or the dependence on the host for a particular metabolite. We expect this new database will become a valuable resource for fundamental and applied research on the Apicomplexa.
New methods for performing quantitative proteome analyses based on differential labeling protocols or label-free techniques are reported in the literature on an almost monthly basis. In parallel, a correspondingly vast number of software tools for the analysis of quantitative proteomics data has also been described in the literature and produced by private companies. In this article we focus on the review of some of the most popular techniques in the field and present a critical appraisal of several software packages available to process and analyze the data produced. We also describe the importance of community standards to support the wide range of software, which may assist researchers in the analysis of data using different platforms and protocols. It is intended that this review will serve bench scientists both as a useful reference and a guide to the selection and use of different pipelines to perform quantitative proteomics data analysis. We have produced a web-based tool (http://www.proteosuite.org/?q=other_resources) to help researchers find appropriate software for their local instrumentation, available file formats, and quantitative methodology.
Numerous software packages exist to provide support for quantifying peptides and proteins from mass spectrometry (MS) data. However, many support only a subset of experimental methods or instrument types, meaning that laboratories often have to use multiple software packages. The Progenesis LC-MS software package from Nonlinear Dynamics is a software solution for label-free quantitation. However, many laboratories using Progenesis also wish to employ stable isotope-based methods that are not natively supported in Progenesis. We have developed a Java programming interface that can use the output files produced by Progenesis, allowing the basic MS features quantified across replicates to be used in a range of different experimental methods. We have developed post-processing software (the Progenesis Post-Processor) to embed Progenesis in the analysis of stable isotope labeling data and top3 pseudo-absolute quantitation. We have also created export ability to the new data standard, mzQuantML, produced by the Proteomics Standards Initiative to facilitate the development and standardization process. The software is provided to users with a simple graphical user interface for accessing the different features. The underlying programming interface may also be used by Java developers to develop other routines for analyzing data produced by Progenesis.
We report the release of mzIdentML, an exchange standard for peptide and protein identification data, designed by the Proteomics Standards Initiative. The format was developed by the Proteomics Standards Initiative in collaboration with instrument and software vendors, and the developers of the major open-source projects in proteomics. Software implementations have been developed to enable conversion from most popular proprietary and open-source formats, and mzIdentML will soon be supported by the major public repositories. These developments enable proteomics scientists to start working with the standard for exchanging and publishing data sets in support of publications and they provide a stable platform for bioinformatics groups and commercial software vendors to work with a single file format for identification data.
The Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI) has developed the GelML data exchange format for representing gel electrophoresis experiments performed in proteomics investigations. The format closely follows the reporting guidelines for gel electrophoresis, which are part of the Minimum Information About a Proteomics Experiment (MIAPE) set of modules. GelML supports the capture of metadata (such as experimental protocols) and data (such as gel images) resulting from gel electrophoresis so that laboratories can be compliant with the MIAPE Gel Electrophoresis guidelines, while allowing such data sets to be exchanged or downloaded from public repositories. The format is sufficiently flexible to capture data from a broad range of experimental processes, and complements other PSI formats for mass spectrometry data and the results of protein and peptide identifications to capture entire gel-based proteome workflows. GelML has resulted from the open standardisation process of PSI consisting of both public consultation and anonymous review of the specifications.
data standard; gel electrophoresis; database; ontology
Proteomic techniques allow researchers to perform detailed analyses of cellular states and many studies are published each year, which highlight large numbers of proteins quantified in different samples. However, currently few data sets make it into public databases with sufficient metadata to allow other groups to verify findings, perform data mining or integrate different data sets. The Proteomics Standards Initiative has released a series of "Minimum Information About a Proteomics Experiment" guideline documents (MIAPE modules) and accompanying data exchange formats. This article focuses on proteomic studies based on gel electrophoresis and demonstrates how the corresponding MIAPE modules can be fulfilled and data deposited in public databases, using a new experimental data set as an example.
We have performed a study of the effects of an anabolic agent (salbutamol) at two different time points on the protein complement of rat skeletal muscle cells, quantified by difference gel electrophoresis. In the DIGE study, a total of 31 non-redundant proteins were identified as being potentially modulated at 24 h post treatment and 110 non redundant proteins at 96 h post-treatment. Several categories of function have been highlighted as strongly enriched, providing candidate proteins for further study. We also use the study as an example of best practice for data deposition.
We have deposited all data sets from this study in public databases for further analysis by the community. We also describe more generally how gel-based protein identification data sets can now be deposited in the PRoteomics IDEntifications database (PRIDE), using a new software tool, the PRIDESpotMapper, which we developed to work in conjunction with the PRIDE Converter application. We also demonstrate how the ProteoRed MIAPE generator tool can be used to create and share a complete and compliant set of MIAPE reports for this experiment and others.
The allele frequency net database (http://www.allelefrequencies.net) is an online repository that contains information on the frequencies of immune genes and their corresponding alleles in different populations. The extensive variability observed in genes and alleles related to the immune system response and its significance in transplantation, disease association studies and diversity in populations led to the development of this electronic resource. At present, the system contains data from 1133 populations in 608 813 individuals on the frequency of genes from different polymorphic regions such as human leukocyte antigens, killer-cell immunoglobulin-like receptors, major histocompatibility complex Class I chain-related genes and a number of cytokine gene polymorphisms. The project was designed to create a central source for the storage of frequency data and provide individuals with a set of bioinformatics tools to analyze the occurrence of these variants in worldwide populations. The resource has been used in a wide variety of contexts, including clinical applications (histocompatibility, immunology, epidemiology and pharmacogenetics) and population genetics. Demographic information, frequency data and searching tools can be freely accessed through the website.
Tandem mass spectrometry, run in combination with liquid chromatography (LC-MS/MS), can generate large numbers of peptide and protein identifications, for which a variety of database search engines are available. Distinguishing correct identifications from false positives is far from trivial because all data sets are noisy, and tend to be too large for manual inspection, therefore probabilistic methods must be employed to balance the trade-off between sensitivity and specificity. Decoy databases are becoming widely used to place statistical confidence in results sets, allowing the false discovery rate (FDR) to be estimated. It has previously been demonstrated that different MS search engines produce different peptide identification sets, and as such, employing more than one search engine could result in an increased number of peptides being identified. However, such efforts are hindered by the lack of a single scoring framework employed by all search engines.
We have developed a search engine independent scoring framework based on FDR which allows peptide identifications from different search engines to be combined, called the FDRScore. We observe that peptide identifications made by three search engines are infrequently false positives, and identifications made by only a single search engine, even with a strong score from the source search engine, are significantly more likely to be false positives. We have developed a second score based on the FDR within peptide identifications grouped according to the set of search engines that have made the identification, called the combined FDRScore. We demonstrate by searching large publicly available data sets that the combined FDRScore can differentiate between between correct and incorrect peptide identifications with high accuracy, allowing on average 35% more peptide identifications to be made at a fixed FDR than using a single search engine.
proteomics; mass spectrometry; decoy database; search engine; scoring; false discovery rate
Flow cytometry technology is widely used in both health care and research. The rapid expansion of flow cytometry applications has outpaced the development of data storage and analysis tools. Collaborative efforts being taken to eliminate this gap include building common vocabularies and ontologies, designing generic data models, and defining data exchange formats. The Minimum Information about a Flow Cytometry Experiment (MIFlowCyt) standard was recently adopted by the International Society for Advancement of Cytometry. This standard guides researchers on the information that should be included in peer reviewed publications, but it is insufficient for data exchange and integration between computational systems. The Functional Genomics Experiment (FuGE) formalizes common aspects of comprehensive and high throughput experiments across different biological technologies. We have extended FuGE object model to accommodate flow cytometry data and metadata.
We used the MagicDraw modelling tool to design a UML model (Flow-OM) according to the FuGE extension guidelines and the AndroMDA toolkit to transform the model to a markup language (Flow-ML). We mapped each MIFlowCyt term to either an existing FuGE class or to a new FuGEFlow class. The development environment was validated by comparing the official FuGE XSD to the schema we generated from the FuGE object model using our configuration. After the Flow-OM model was completed, the final version of the Flow-ML was generated and validated against an example MIFlowCyt compliant experiment description.
The extension of FuGE for flow cytometry has resulted in a generic FuGE-compliant data model (FuGEFlow), which accommodates and links together all information required by MIFlowCyt. The FuGEFlow model can be used to build software and databases using FuGE software toolkits to facilitate automated exchange and manipulation of potentially large flow cytometry experimental data sets. Additional project documentation, including reusable design patterns and a guide for setting up a development environment, was contributed back to the FuGE project.
We have shown that an extension of FuGE can be used to transform minimum information requirements in natural language to markup language in XML. Extending FuGE required significant effort, but in our experiences the benefits outweighed the costs. The FuGEFlow is expected to play a central role in describing flow cytometry experiments and ultimately facilitating data exchange including public flow cytometry repositories currently under development.
A proteomics analysis identifies one third of the predicted Toxoplasma gondii proteins and integrates proteomics and genomics data to refine genome annotation.
Although the genomes of many of the most important human and animal pathogens have now been sequenced, our understanding of the actual proteins expressed by these genomes and how well they predict protein sequence and expression is still deficient. We have used three complementary approaches (two-dimensional electrophoresis, gel-liquid chromatography linked tandem mass spectrometry and MudPIT) to analyze the proteome of Toxoplasma gondii, a parasite of medical and veterinary significance, and have developed a public repository for these data within ToxoDB, making for the first time proteomics data an integral part of this key genome resource.
The draft genome for Toxoplasma predicts around 8,000 genes with varying degrees of confidence. Our data demonstrate how proteomics can inform these predictions and help discover new genes. We have identified nearly one-third (2,252) of all the predicted proteins, with 2,477 intron-spanning peptides providing supporting evidence for correct splice site annotation. Functional predictions for each protein and key pathways were determined from the proteome. Importantly, we show evidence for many proteins that match alternative gene models, or previously unpredicted genes. For example, approximately 15% of peptides matched more convincingly to alternative gene models. We also compared our data with existing transcriptional data in which we highlight apparent discrepancies between gene transcription and protein expression.
Our data demonstrate the importance of protein data in expression profiling experiments and highlight the necessity of integrating proteomic with genomic data so that iterative refinements of both annotation and expression models are possible.
Whilst the array of techniques available for quantitative proteomics continues to grow, the attendant bioinformatic software tools are similarly expanding in number. The data capture and analysis of such quantitative data is obviously crucial to the experiment and the methods used to process it will critically affect the quality of the data obtained. These tools must deal with a variety of issues, including identification of labelled and unlabelled peptide species, location of the corresponding mass spectrometry scans in the experiment, construction of representative ion chromatograms, location of the true peptide ion chromatogram start and end, elimination of background signal in the mass spectrum and chromatogram, and calculation of both peptide and protein ratios/abundances. A variety of tools and approaches are available, in part restricted by the nature of the experiment to be performed and available instrumentation. Currently, although there is no single consensus on precisely how to calculate protein and peptide abundances, many common themes have emerged which identify and reduce many of the key sources of error. These issues will be discussed, along with those relating to deposition of quantitative data. At present, mature data standards for quantitative proteomics are not yet available, although formats are beginning to emerge.
proteomics; relative quantitation; absolute quantitation; software; bioinformatics; data standards
Proteomics continues to play a critical role in post-genomic science as continued advances in mass spectrometry and analytical chemistry support the separation and identification of increasing numbers of peptides and proteins from their characteristic mass spectra. In order to facilitate the sharing of this data, various standard formats have been, and continue to be, developed. Still not fully mature however, these are not yet able to cope with the increasing number of quantitative proteomic technologies that are being developed.
We propose an extension to the PRIDE and mzData XML schema to accommodate the concept of multiple samples per experiment, and in addition, capture the intensities of the iTRAQTM reporter ions in the entry. A simple Java-client has been developed to capture and convert the raw data from common spectral file formats, which also uses a third-party open source tool for the generation of iTRAQTM reported intensities from Mascot output, into a valid PRIDE XML entry.
We describe an extension to the PRIDE and mzData schemas to enable the capture of quantitative data. Currently this is limited to iTRAQTM data but is readily extensible for other quantitative proteomic technologies. Furthermore, a software tool has been developed which enables conversion from various mass spectrum file formats and corresponding Mascot peptide identifications to PRIDE formatted XML. The tool represents a simple approach to preparing quantitative and qualitative data for submission to repositories such as PRIDE, which is necessary to facilitate data deposition and sharing in public domain database. The software is freely available from .
Several data formats have been developed for large scale biological experiments, using a variety of methodologies. Most data formats contain a mechanism for allowing extensions to encode unanticipated data types. Extensions to data formats are important because the experimental methodologies tend to be fairly diverse and rapidly evolving, which hinders the creation of formats that will be stable over time.
In this paper we review the data formats that exist in functional genomics, some of which have become de facto or de jure standards, with a particular focus on how each domain has been modelled, and how each format allows extensions. We describe the tasks that are frequently performed over data formats and analyse how well each task is supported by a particular modelling structure.
From our analysis, we make recommendations as to the types of modelling structure that are most suitable for particular types of experimental annotation. There are several standards currently under development that we believe could benefit from systematically following a set of guidelines.
Drug-induced liver injury (DILI) is one of the most common adverse reactions leading to product withdrawal post-marketing. Recently, genome-wide association studies have identified a number of human leukocyte antigen (HLA) alleles associated with DILI; however, the cellular and chemical mechanisms are not fully understood.
To study these mechanisms, we established an HLA-typed cell archive from 400 healthy volunteers. In addition, we utilized HLA genotype data from more than four million individuals from publicly accessible repositories such as the Allele Frequency Net Database, Major Histocompatibility Complex Database and Immune Epitope Database to study the HLA alleles associated with DILI. We utilized novel in silico strategies to examine HLA haplotype relationships among the alleles associated with DILI by using bioinformatics tools such as NetMHCpan, PyPop, GraphViz, PHYLIP and TreeView.
We demonstrated that many of the alleles that have been associated with liver injury induced by structurally diverse drugs (flucloxacillin, co-amoxiclav, ximelagatran, lapatinib, lumiracoxib) reside on common HLA haplotypes, which were present in populations of diverse ethnicity.
Our bioinformatic analysis indicates that there may be a connection between the different HLA alleles associated with DILI caused by therapeutically and structurally different drugs, possibly through peptide binding of one of the HLA alleles that defines the causal haplotype. Further functional work, together with next-generation sequencing techniques, will be needed to define the causal alleles associated with DILI.
XGAP, a software platform for the integration and analysis of genotype and phenotype data.
We present an extensible software model for the genotype and phenotype community, XGAP. Readers can download a standard XGAP (http://www.xgap.org) or auto-generate a custom version using MOLGENIS with programming interfaces to R-software and web-services or user interfaces for biologists. XGAP has simple load formats for any type of genotype, epigenotype, transcript, protein, metabolite or other phenotype data. Current functionality includes tools ranging from eQTL analysis in mouse to genome-wide association studies in humans.